[Math] Relationship between compression, shannon entropy and kolmogorov complexity

entropykolmogorov-complexity

I have read that the Shannon Entropy is used as a bound for the compressibility of a message, for example here 1 it says "In other words, the best possible lossless compression rate is the entropy rate.".

This makes a lot of sense if we look at a message m which is created by a random source, each bit m[i] is either 1 or 0 with probability p=0.5. Then we get a Shannon entropy of 1 bit per symbol, the message is not compressible.

Now let us look at a message m' where m[i] = i mod 2. For a long message we get the same probability distribution of symbols as in the previous example if we consider each bit as its own symbol. However, if we consider two adjacent bits as one symbol, then we have only 01 01 01 always repeating, so the Shannon entropy is 0. So my first question is: How to chose what constitutes a symbol when calculating Shannon entropy?

A related thing seems to be Kolmogoroff complexity: The Kolmogorov complexity of a string m is the number of bits which are needed for the shortest program which produces m when executed. How are the two related? For example if we build a fibonacci string s{x}, such that s{0} = 0, s{1} = 1, s{n}=s{n-2}s{s-1} it is clear that the Kolmogoroff complexity is very small (we just write a program that executes above instruction in a loop x times). Therefore we can send this program and the integer x as our "compressed data". However, as 1 claims that Shannon Entropy is a lower bound on compressed data size it should be equally small. But in this case how do we choose our symbols so that the Shannon Entropy gets close to zero?

I would really appreciate if someone could provide any intuition on this.

My thoughts on this is that maybe that sources such as 1 are a bit misleading, because Shannon just assumes we have the optimal probability distribution already, but actually it is not computable, just like Kolmogoroff complexity is not computable in general. Therefore if we use simple methods to compute a probability distribution, i.e., by assuming 1 symbol = 1 bit we get an approximation of the Shannon Entropy, not the exact value.

Regards,
Timo

Best Answer

The actual choice of symbols does not affect Shannon entropy. In the example sequence 0 1 0 1 0 1 0 1 0 ... the fact that gives zero Shannon entropy is not to have certain symbols, but to have a very rigid structure. There are basically only two different states (corresponding to the parity of the index within the sequence). This stays the same if you increase the block size (considering 01 as a symbol, or taking 010 and 101 as symbols of length 3 etc.). What entropy measures is basically the exponential growth rate of those words (states) when the length of the blocks is increasing. If the number of such words does not grow with increasing length, this gives zero entropy.

For a random sequence you would expect to see all (most) symbol sequences of a fixed length. Over a binary alphabet, the number of those blocks would be $2^n$ where $n$ is the length. Now this has an exponential growth rate of $2$, which taking the usual logarithm (in base 2) yields the one-bit-per symbol entropy.

Related Question