Why Shannon’s Entropy Uses Logarithms – Information Theory

entropyinformation theory

Given a variable $X$ with states $\lbrace 1,2,\ldots,n \rbrace$ with probabilities $\lbrace p_1,p_2,\ldots,p_n \rbrace$ we can associate with the probability distribution an entropy defined as $$ H(\lbrace p_i \rbrace)= -\sum_{i=0}^n p_i \log p_i = \left\langle \log{\frac{1}{p}} \right\rangle $$ where $\langle\cdot\rangle$ denotes the average over the probabilites $\lbrace p_i \rbrace$.

In terms of information, then I can relate this to the number of questions I need to make in order to completely specify the configuration the variable is in (which is the same as the number of bits needed to store the configuration).

Why do the bits go with $\log$?

For a constant probability distribution the probability of each state is $p_i=1/n$ (with $1/p$ being the number of states) and the number of questions I need to make can be estimating visually by drawing the sample space:

The biggest square is the sample space of area 1. Each division in half represents a binary question.

Now I can first divide the sample space S into two regions and ask: is it contained in the first region or the second? If it is in the second, I ask: is it contained in the new first half or the second half? and so on until the regions I'm dividing into are roughly the same size as my probability. The number of questions $m$ are approximately (it is drawn as exact but $p_i$ needn't be exactly $1/16$ of the space as in the figure):
$$ \frac{1}{2^m} \sim p_i \rightarrow m \sim \log \frac{1}{p_i} $$
The problem is that this is justified only for uniform distributions.

Now, my question is: this is not the most efficient way to get information for non-uniform distributions. The best way, given an ordering of my probabilities from biggest to lowest $p_{i_1}^{1)}\ge p_{i_2}^{2)}\ge\ldots \ge p_{i_n}^{n)}$, is: "is it the most probable configuration, $p_{i_1}^{1)}$?" If it is not, "is it the second most probable configuration, $p_{i_2}^{2)}$? And so on. But I have no idea how to get that the entropy would go with $\log \frac{1}{p_i}$ given this more efficient way of getting information.

Thank you for you time.

Best Answer

Note that the problem of finding the most efficient series of questions with "yes-no" answers in order to determine the outcome of $X$ is equivalent to what is known in information theory as the source coding problem, i.e., assign to the $k$-th outcome a binary codeword (number) of size $L_k$ such that the codewords are distinguishable and have an average length $\bar{L}\triangleq \sum_k p_k L_k$ as small as possible.

The method you suggest for finding the outcome of a non-equiprobable $X$ is essentially assigning the following codewords to each outcome (ordered in decreasing probability): $$ \begin{align} 1&\text{ }(L_1=1)\\ 01&\text{ }(L_2=2)\\ 001&\text{ }(L_3=3)\\ \vdots\\ 00\ldots01 & \text{ }(L_n=n) \end{align} $$

However, this is $not$ the best we can do. This is because the most efficient question at each stage of the "game" is the one whose two outcomes are as close as possible to equiprobable, which, in general, is not accomplished by your approach. The optimal series of questions should be as follows: For the first question, partition the sample space in two sets (events) whose probabilities are as close to $0.5$ as possible and ask whether $X$ belongs to one of these sets. Once you know the set $X$ is in, repeat the procedure for the second question, this time considering only the "reduced" sample space corresponding to the known set, and so on.

One well-known source coding algorithm that follows this principle is Huffman source coding. It can be shown the for the average Huffman codeword length it holds $\bar{L} \in [H(\{p_i\}), H(\{p_{i}\})+1] $, i.e., the average number of binary questions required to determine the outcome of $X$ is close to the entropy of $X$.

Related Question