[Math] Information Theory – Shannon’s “Self-Information” units

information theory

Shannon's "self-information" of the specific outcome "A" is given as:
-log(Pr(A)), and the entropy is the expectation of the "self-information" of all the outcomes of the random variable.

When the base of the log is 2, the units of information/entropy are called "bits".

What is the best explanation the following simple question:

Why do these information units are called "bits"?

Best Answer

One good reason to call them bits is that this is the number of bits that you need on average to encode an outcome. Some Wikipedia articles you might want to take a look at are Huffman coding, arithmetic coding, entropy encoding and Shannon's source coding theorem.

To give a simple example, say outcome A has probability $1/2$ and outcomes B and C have probabilities $1/4$ each. Then you can encode A by $0$, B by $10$ and C by $11$. This is an optimal prefix-free code; the expected number of bits required to encode an outcome is $\frac12\cdot1+\frac14\cdot2+\frac14\cdot2=\frac32$, and since the number of bits in each code is the self-information (to base $2$) of the outcome it encodes, this expected number of bits is the entropy of the distribution.

Related Question