Hi.
I've implemented arithmetic coding together with finite state and context models. Currently it's ok with rational arithmetic (S*L*O*W) and still buggy with floating point arithmetic. I'm taking a rest from it since I spent about a year with it, getting about 3+k bytes/sec with floating point arithmetic (still buggy, but the complexity doesn't change with it). No integer arithmetic neither blending models yet. Anyway, I found a lot of material in a book titled Text Compression that stresses the idea of this model for data compression.
1. A coder that knows how to encode and decode the results of 2. A probabilistic model that gives probabilities telling how probable are the events possible.
This model has resulted an EXCELLENT idea, although it was meant for C language. It's a model codecs in Squeak don't follow, I guess for execution speed reasons. But nevertheless I'd like to tell briefly how the splitting of the model and the coder turns out to be a good idea.
All compression schemes have two main components. They're the model and the coder. The last one has to write a stream of symbols based on what the first predicts about something that actually happens.
With this vague idea in mind, let's see what happens. There's a finite amount of states that can happen at any given time, like one value of a byte, more than 110v in the electrical line, whatever. And there's a *finite* amount, it's very important for this that the amount of states is finite and determined before start. The sequence of finite amount of states that can happen one at a time, at any given time, will be our source; while the state list will be our alphabet. The whole idea of compression is to describe our source in terms of an alphabet, in a sequence of shorter length than the source which we will call the destination. We're assuming for the moment that the I/O alphabets are the same (or at least isomorphic somehow).
Roughly speaking, ad hoc techniques are not a good idea when compression in the general case is to be achieved, because they only work in a very specific environment. Adaptive schemes are a must. The whole idea is that there's a model that is fed all the events that have already happened, and based on this predicts with a certain probability the occurrence of any of the symbols (states) of the alphabet. How the model comes up with the prediction is not important, the thing is that we're fed with predictions.
These predictions (probability of the symbol that actually happened) are subsequently sent to the coder, that somehow writes enough information about what happened to be able to decode it later. How this is done is a matter of the coder, and it doesn't care about the procedence of the predictions.
Examples! Examples!
Huffman coding, roughly speaking. The probabilities are given by the frequencies of the symbols, that update a frequency tree. The coder chooses a certain bit sequence for each of the symbols that actually happen. This is an example that's quite illustrative, since Huffman coding is thought to be a single thing process most of the time.
LZ, token replacement in general. The model says when in the past have the sequences happened and the coder writes tokens referring to previous repeated sequences.
LZ, dictionary based in general. This is tantamount to alphabet translation. The model keeps the new alphabet and the coder writes symbols in the new alphabet instead.
Probabilistic model driving an arithmetic coder. Writes bits corresponding to probabilities supplied by the probabilistic model.
The beauty of all this is that when done properly one just implements coders and then uses them with various models without both sides knowing what's going on. Any comments?
Andres.
squeak-dev@lists.squeakfoundation.org