Coding - Squeak-dev

25 Feb 1999


      Hi.
I've implemented arithmetic coding together with finite state and context
models. Currently it's ok with rational arithmetic (S*L*O*W) and still buggy
with floating point arithmetic. I'm taking a rest from it since I spent about a
year with it, getting about 3+k bytes/sec with floating point arithmetic (still
buggy, but the complexity doesn't change with it). No integer arithmetic neither
blending models yet. Anyway, I found a lot of material in a book titled Text
Compression that stresses the idea of this model for data compression.
1. A coder that knows how to encode and decode the results of
2. A probabilistic model that gives probabilities telling how probable are the
events possible.
This model has resulted an EXCELLENT idea, although it was meant for C language.
It's a model codecs in Squeak don't follow, I guess for execution speed reasons.
But nevertheless I'd like to tell briefly how the splitting of the model and the
coder turns out to be a good idea.
All compression schemes have two main components. They're the model and the
coder. The last one has to write a stream of symbols based on what the first
predicts about something that actually happens.
With this vague idea in mind, let's see what happens. There's a finite amount of
states that can happen at any given time, like one value of a byte, more than
110v in the electrical line, whatever. And there's a *finite* amount, it's very
important for this that the amount of states is finite and determined before
start. The sequence of finite amount of states that can happen one at a time, at
any given time, will be our source; while the state list will be our alphabet.
The whole idea of compression is to describe our source in terms of an alphabet,
in a sequence of shorter length than the source which we will call the
destination. We're assuming for the moment that the I/O alphabets are the same
(or at least isomorphic somehow).
Roughly speaking, ad hoc techniques are not a good idea when compression in the
general case is to be achieved, because they only work in a very specific
environment. Adaptive schemes are a must. The whole idea is that there's a model
that is fed all the events that have already happened, and based on this
predicts with a certain probability the occurrence of any of the symbols
(states) of the alphabet. How the model comes up with the prediction is not
important, the thing is that we're fed with predictions.
These predictions (probability of the symbol that actually happened) are
subsequently sent to the coder, that somehow writes enough information about
what happened to be able to decode it later. How this is done is a matter of the
coder, and it doesn't care about the procedence of the predictions.
Examples! Examples!
Huffman coding, roughly speaking.
The probabilities are given by the frequencies of the symbols, that update a
frequency tree. The coder chooses a certain bit sequence for each of the symbols
that actually happen. This is an example that's quite illustrative, since
Huffman coding is thought to be a single thing process most of the time.
LZ, token replacement in general.
The model says when in the past have the sequences happened and the coder writes
tokens referring to previous repeated sequences.
LZ, dictionary based in general.
This is tantamount to alphabet translation. The model keeps the new alphabet and
the coder writes symbols in the new alphabet instead.
Probabilistic model driving an arithmetic coder.
Writes bits corresponding to probabilities supplied by the probabilistic model.
The beauty of all this is that when done properly one just implements coders and
then uses them with various models without both sides knowing what's going on.
Any comments?
Andres.