Interpretation of Shannon Entropy formula as expected value

entropyinformation theory

For a random variable $X$ with potential outcomes $\{x_1,x_2,…,x_n\}$ and probabilities $p(x_1),p(x_2),…,p(x_n)$ the formula for the Shannon Entropy is $H(X) = -\sum_{x \in X}{p(x) log_{2}(p(x))} $.

If I denote $b(x) = -log_2(p(x))$, then the formula becomes:

$$H(X) = \sum_{x \in X}{p(x) (-log_{2}(p(x)))}= \sum_{x \in X}{p(x) b(x)} $$

Now, Cover & Thomas write: "The entropy is a measure of the average uncertainty in the random variable. It is the number of bits on average required to describe the random
variable.
" We also know that the expected value of a function $f$ of a random variable $X$ is $E_X[f(x)] = \sum_{x \in X}{(p(x)f(x))}$.

Thus, the formula for entropy can be rewritten as the expected value of $b(X)$ that is:

$$H(X) = E_X[b(x)] = \sum_{x \in X}{(p(x)b(x))}$$

Questions:

  1. Is it right to think of $H(X)$ as $E_X[b(x)]$ or is this just an algebraic coincidence?
  2. If 1. is correct, then is it also correct that $b(X=x)$ is the number of bits required to describe the potential outcome $x$ of $X$?
  3. If 2. is correct, why do potential outcomes with $p(x)=1$ require 0 bits, $p(x)=0.5$ require 1 bit, etc. ? Essentialy, why use $b(x)$ and not another function?
  4. Is it a correct interpretation to state that potential outcomes with $p(x)=0$ require an infinite amount of bits?
  5. Is there as specific name for the function $b(x) = -log_2(p(x))$?

Note: I 've already researched previous questions on Shannon Entropy but none goes directly to the issue of the interpretation of $H(x)$ as $E_X[b(x)]$.

Best Answer

  1. Yes, it's the most important justification for defining entropy in the first place.

  2. Yes, but only in a systematic way. You could use just $1$ bit to transmit any information, including combined ones. For example, if I send you a $0$ tomorrow, that would mean bitcoin values drop by 50% and WW3 break out. Unfortunately, as the rather rich information has such low probability, it would be a total waste of a valuable single bit to encode it, as all the other more likely messages have to be sent with longer codes. That's why taking average is important for knowing how many bits are actually needed for a single outcome (which by the way, could be further divided in reality).

  3. Because if $p(x)=1$, we both know it happens for sure (At least almost surely), and hence there is no need to transmit this information. If $p(x)=0.5$, like flipping a coin, then it takes optimally 1 bit to transmit, but this needs further clarification, as after all the expectation value depends on the systematic way to design the coding, not just a single event.

  4. Yes, if something doesn't happen at all, you would not expect to transmit the message in the first place, so you won't include them in your design of codes. (Note that this only applies to discrete probability distributions that have only finitely many outcomes, not say a Gaussian distribution.)

  5. You don't want a name but a justification. Let's do it now.

We want to find the function $b$ such that $E_X(b) = \sum_x p(x)b(x)$ is minimized. The important thing here is to figure out the constraint. Of course $c(x)\ge 0$ (and $c(x)$ also has to be integers, but we'll ignore the problem for now). We also need to make sure the encoding is unambiguous, and this turns out to be equivalent to the so-called Kraft–McMillan inequality: $\sum_x 2^{-b(x)}\le 1$. The proof is not difficult and we shall skip. Let $c(x) = 2^{-b(x)}$, then we have the following optimization problem $\min_{c} E_x(b) = -\sum_x p(x) \log_2 c(x)$ that is equivalent to $$\max_c \sum_x p(x)\log_2 c(x)$$ subject to the constraint $$\forall x, c(x))\ge 0, \text{ and }\sum_x c(x)= 1$$

Here we used $\sum_x c(x)=1$ instead of $\le$ to get an upper bound of the problem.

The solution can be found by Lagrange multiplier, which is exactly $c(x)=p(x), \forall x$. This shows that $H(x)$ as defined is an upper bound for the average coding length. Due to $b(x)$ has to be integers, this upper bound may not be achieved, and we might need one extra bit, which can always be realized by the Huffman code.

On the other hand, if we are allowed to send combined iid results (which is the case for the task of compressing fixed texts, instead of real time long distance communication), we may find $n$ such that $n H(x)$ is very close to an integer than $H(x)$ to approximate $H(x)$ as close as we want. This is essentially Shannon's source coding theorem.

This is just one way to justify the definition of $H(x)$. There are other ways using typical texts, see e.g. Witten's exposition.

Related Question