Entropy Information-Theory – How to Interpret Differential Entropy: A Detailed Guide

entropyinformation theory

I recently read this article on the entropy of a discrete probability distribution. It describes a nice way of thinking about entropy as the expected number bits (at least when using $\log_2$ in your entropy definition) needed to encode a message when your encoding is optimal, given the probability distribution of the words you use.

However, when extending to the continuous case like here I believe this way of thinking breaks down, since $\sum_x p(x) = \infty$ for any continuous probability distribution $p(x)$ (please correct me if that is wrong), so I was wondering if there is nice way of thinking about what continuous entropy means, just like with the discrete case.

Best Answer

There is no interpretation of differential entropy which would be as meaningful or useful as that of entropy. The problem with continuous random variables is that their values typically have 0 probability, and therefore would require an infinite number of bits to encode.

If you look at the limit of discrete entropy by measuring the probability of intervals $[n\varepsilon, (n + 1)\varepsilon[$, you end up with

$$-\int p(x) \log_2 p(x) \, dx - \log_2 \varepsilon$$

and not the differential entropy. This quantity is in a sense more meaningful, but will diverge to infinity as we take smaller and smaller intervals. It makes sense, since we'll need more and more bits to encode in which of the many intervals the value of our random value falls.

A more useful quantity to look at for continuous distributions is the relative entropy (also Kullback-Leibler divergence). For discrete distributions:

$$D_\text{KL}[P || Q] = \sum_x P(x) \log_2 \frac{P(x)}{Q(x)}.$$

It measures the number of extra bits used when the true distribution is $P$, but we use $-\log Q_2(x)$ bits to encode $x$. We can take the limit of relative entropy and arrive at

$$D_\text{KL}[p \mid\mid q] = \int p(x) \log_2 \frac{p(x)}{q(x)} \, dx,$$

because $\log_2 \varepsilon$ will cancel. For continuous distributions this corresponds to the number of extra bits used in the limit of infinitesimally small bins. For both continuous and discrete distributions, this is always non-negative.

Now, we could think of differential entropy as the negative relative entropy between $p(x)$ and an unnormalized density $\lambda(x) = 1$,

$$-\int p(x) \log_2 p(x) \, dx = -D_\text{KL}[p \mid\mid \lambda].$$

Its interpretation would be the difference in the number of bits required by using $-\log_2 \int_{n\varepsilon}^{(n + 1)\varepsilon} p(x) \, dx$ bits to encode the $n$-th interval instead of $-\log \varepsilon$ bits. Even though the former would be optimal, this difference can now be negative, because $\lambda$ is cheating (by not integrating to 1) and therefore might assign fewer bits on average than theoretically possible.

See Sergio Verdu's talk for a great introduction to relative entropy.

Related Question