- Context Modeling
- Heuristic #1: Eliminate Contexts
- Heuristic #2: Quantize Context Distributions
- Heuristic #3: Compactly Represent Quantization Points
- Indexing the Contexts
- Final Result
The original idea I had in mind was to write code that somehow compresses short strings. However, upon consideration, two problems presented themselves:
- General-purpose compression algorithms rely on some self-redundancy contained within the data to be compressed. In this case, the data to be compressed is short! Seeing the kind of redundancy required for reasonable compression under such schemes seems rather improbable in short strings. Therefore, some information must be known beforehand. This leads one to a semi-static solution, as used by Shoelacer.
- Semi-static models tend to fall into one of two categories: dictionary and context modeling. Both can achieve "good" performance, but also require "large" amounts of memory to do so. As a result, their utility is rather limited. My goal was to use even smaller models. Small enough to not be a burden on cache. Small enough to store the entire thing, or possibly multiple copies, within cache, and ideally fast enough to exploit this. A solution to this was not so forthcoming.
Initial tests with simple dictionary schemes ruled it out of my selection. Dictionary sizes for reasonable compression performance grew too large for my satisfaction. On the other hand, simple order-1 context modelling had impressive performance, and although naive model storage required far too much space, drastic reductions were easy to imagine. Thus the context modeling scheme was selected, and from there, space savings were engineered (sometimes at the cost of compression performance) until what became Shoelacer.
As with many utilities that output generated code, Shoelacer generates tables, and outputs them along with canned code to achieve the desired effect. Put another way, the actual compression and decompression routines are fixed with little exceptions, but the model passed via an argument to them is pre-computed by Shoelacer for a given set of data.
Shoelacer generates its models based on blended order-2 and order-0 context models. Specifically, the model consists of 512 contexts indexed by a perfect hash, with one byte representing each entry per module containing a checksum of the context bytes, and the index of a quantized distribution. The number of unique distributions typically ranges from 16 to 64. They are represented so that they are convenient for Rice coding with varying power-of-two divisors. If the checksum fails to match the context hashed to it, a default "order-0" distribution is used, which may or may not be used by contexts normally.
Steps Shoelacer performs:
- Accumulates order-2 contexts
- Discards all but the 512 most frequent contexts
- Constructs a perfect hash of the remaining contexts
- Calculates probability distributions per each context
- Quantizes the distributions
- Encodes distributions using a greedy algorithm to minimize space
- Outputs encoded tables and canned C code
An order-n model is constructed by assigning a probability of the next character given the previous n characters. Suppose we've a portion of actual text below, and the character above the carrot symbol is the last decoded symbol.
...ver the lazy do...
Supposing we're dealing with an order-3 context model, and we'd like to predict what the next character is, we consider the last three characters (" th" including a space), and the expected probability of characters which follow. In this case it would seem reasonable to assume that 'e' has the highest probability (perhaps .45), where 'x' has a very low probability (perhaps even 0.0).
Context modeling is very powerful, and has been used in many high-performance compression algorithms (e.g. all the prediction by partial matching variants), but it also has a few problems. The most notable of which is the unseen-context problem. What do you do if you're faced with a context you've never seen before? This is the extreme of another problem: What do you do if you're faced with a context that you've only seen once or a few times before? In both cases, there's too little information acquired to make a good decision. The obvious heuristic would be to encode the symbol literally (that is, as if using an order-(-1) context model). The extension of this is to use a lower order model in which the lower order context has been observed a sufficient number of times. (e.g. look at the last two characters instead of 3) This is called blending.
While even order-1 context modeling can lead to good compression performance, I wanted at least a bit better. The danger is that with higher order context models, blending becomes more necessary, and complexity increases accordingly. Also, with higher orders, more contexts are required, so only a small portion of the 'context space' can be represented. After experimentation, order-2 appeared to be a reasonable compromise.
Heuristic #1: Eliminate Contexts
With order-2 context modeling with byte-sized symbols, 65536 possible contexts exist. This exceeds the goal of a few kilobytes by an order of magnitude! But what are the chances of seeing 'qx'? It's likely that we don't need to store any information about the context 'qx', since it's unlikely to occur. So we can eliminate 'qx'. In fact, Shoelacer eliminates all contexts, save for the top 512 most frequent contexts.
Why is this not optimal elimination? Because unknown contexts are treated as order-0. If the distribution of symbols after an order-2 context is identical to the order-0 context, storing this context gains us nothing, and we've duplicated the order-0 information in memory. Actually, Shoelacer mitigates the latter problem, but the context is still explicitly stored needlessly.
Heuristic #2: Quantize Context Distributions
Recall that each context has an associated probability distribution which accounts for each of 256 values a byte could represent. Even if we used only a bit per entry, 16 kilobytes (256*512/(8*1024)) of memory would be required. So although a compact representation of the probability distributions is key, it alone is insufficient.
Instead, I borrow from the domain of lossy compression to compress the probability distributions. Treating the distributions of the top 512 contexts, Shoelacer performs vector quantization via the LBG algorithm using a Euclidean metric. Put another way, many of the probability distributions are very similar. One tool to reduce a larger set of vectors (one possible representation of the distributions) into a smaller set is vector quantization. The number of points quantized to is a power of two between 1 and 128 inclusive.
A couple of interesting aspects arise. First, the algorithm must actually produce distributions. That is, the sum of all the entries must be 1. The LBG algorithm relies on iterative averaging of distributions, and the average of a collection of distributions is provably a distribution, so this requirement is met. Secondly, the distortion we wish to minimize is not a actually a metric! None-the-less, even when a pseudo-metric was used that was far more representative of actual entropy, the results were weaker than using the Euclidean metric. Also, through experimentation, a post-quantization step which takes entropy entirely into consideration and readjusts the quantization points was found to increase compression performance. However, a better quantization scheme likely exists.
Heuristic #3: Compactly Represent Quantization Points
Finally, we've a reasonable glimpse at achieving the goal. By using direct observation, I concluded many distributions had a few symbols with high probabilities, and other distributions were more spread. I chose the following scheme for storing distributions:
|Byte #0||Distribution scheme|
|Byte #1||Number of symbols accounted for|
|Byte #2||Index of the unknown symbol|
|Data||A byte per symbol with the value of that symbol|
The 'unknown' symbol is an escape for any symbol which is not accounted for in this representation of a distribution. The data table is sorted by frequency of symbol. The 'distribution scheme' is an indicator of how to interpret the data table.
Initially, the various 'distribution schemes' supported were Elias gamma and Rice coding. After experimentation, Elias gamma coding was found to be mostly unused, and thus now this byte serves to represent the length of the Rice coding divisor. (I call it the 'base', but this is confusing due to the generic nature of Golomb-Rice coding, and its actual use in literature.)
Indexing the Contexts
When encoding or decoding, we must associate stored order-2 contexts with their compacted distribution, and if not-existent, use the order-0 compacted distribution. To do so, we require an indexing structure of some kind. We have the advantage in that we're doing this offline -- we can take as much time as we need to construct it -- but we'd like it to be very compact and fast.
Perfect hashing lends itself well to these circumstances. Bob Jenkins has implemented a perfect hash generation algorithm which generates hashes in the style of a^tab[b] where (a,b) is a unique pair of hashes of the index. In our case, where p is the previous character, and c is the current character, a=p; b=p+c; are the "hashes". They seem sufficient for a wide variety of data, despite their simplistic and mocking nature compared to what a canonical hash function should be.
The table consists of 256 short (2-byte) integers. The result of running the hash algorithm is this table which maps each pair (a,b) to a unique number between 0 and 511 inclusive. These are used as the indices to a table containing an entry per each stored context. The entries are one byte long. The lower bits of the byte are used to store which distribution is used. The high bits are used as a checksum (the low bits of 'b') to reduce the number of false positives matched to the entry. Consequently, as the number of distributions increases, so does the number of false positives. False positives decrease compression performance, while the more distributions increases it.
Finally, the distributions themselves have a table of indices (2-byte integer entries, one per distribution) to where, from the model data start, each compact representation begins.
Models which require 2*256 + 2*D + C + X bytes, where X is a tunable parameter for how much space we allow the compact representations to consume. (The representations are trimmed one byte at a time in a greedy but sub-optimal manner until the total size X is achieved.) Typical models come out to two and an eighth kilobytes.
Compression performance ranges from 3.4 to 4.6 bits per byte on data including executables, books, and databases that I've tested in varying degrees.
Shoelacer is a best guess of a heuristic built on a suboptimal model which uses the wrong algorithm to generate a recklessly small model used to compress potentially meaningless data. I can guarantee there -is- a way to optimize this algorithm. From its context elimination to its context indexing, there are optimizations to be made. But, for a casual 8-week project, the result works unexpectedly well.
Hosting for this project is provided by: