[Math] Intrinsic significance of differential entropy

entropypr.probability

Many commentators (e.g. Jaynes, Rota) argue that the notion of "differential entropy" is problematic (as commonly defined by $ h(X) = \int ( \log\frac{1}{p(x)} ) p(x) \, dx $, where $X$ is a random variable with a probability density function $p$). Differential entropy has several often-mentioned deficiencies, in contrast to discrete entropy:

  • Its values are not always nonnegative.
  • It is not invariant with respect to change of variables.
  • It assumes additional structure: a particular underlying measure, and the existence of the density $p$.

These problems all essentially arise because:

  • It is not derived from information-theoretic first principles; it is merely defined by analogy to $\sum p_i \log\frac{1}{p_i}$.

This evidence suggests that differential entropy does not have much intrinsic significance, if at all. In particular, ideas like the "principle of maximum entropy", whose foundations lie in the information-theoretic role of discrete entropy, should not blindly generalize to differential entropy.

On the other hand, there are major results about differential entropy which seem to say the opposite. For example, it is well-known that, among all distributions with a particular mean and variance, the Gaussian distribution has the greatest differential entropy. Artstein, Ball, Barthe, and Naor have even proven that the differential entropy of a normalized sum of i.i.d. random variables is monotonically increasing, showing that the central limit theorem has behavior similar to the second law of thermodynamics. Maximum entropy ideas seem to carry over from discrete to differential entropy just fine!

Another example is a form of the uncertainty principle (from quantum mechanics and/or Fourier analysis) expressed in terms of differential entropy (see Wikipedia). This form is stronger than the traditional formulation involving standard deviation. Here, differential entropy appears to fill the role of a respectable measure of uncertainty.

Hence, confusion. How can I reconcile these conflicting bodies of evidence? Should the complaints about differential entropy be dismissed? Or are the nontrivial results about differential entropy somehow not meaningful, except perhaps as shadows of something more natural? Is differential entropy really conceptually significant?

Your insight is appreciated.

Best Answer

We can find some attempts for generalizing the notion of entropy of discrete random variables to random variables with general distribution function.

A straightforward way is to employ Riemann sum of the distribution function. So we start with a discrete random variable and then by making the intervals small enough, the entropy function is obtained. Denote the quantized random variable by $X_\delta$ where $\delta$ is the size of the intervals. If the probability density function $f$ is integrable, we can see for small $\delta$ (Cover-Thomas p. 248): $$ H(X_\delta)\approx h(X)-\log\delta. $$ By choosing $\delta$ equal to $2^{-n}$, i.e. $n$ bit quantization, we get

$$ H(X_\delta)\approx h(X)+n $$ which represents how many bits we need to describe $X$ with $n$ bit accuracy. This shows somehow the relation between differential entropy and discrete entropy. Note that when $\delta\to 0$, $H(X_\delta)\to\infty$.

Another point is that the mutual information does not change using this method, namely if $\delta\to 0$: $$ I(X_\delta;Y_\delta)=I(X;Y). $$

The generalization is attributed to different people, among them mainly Kolmogorov and Rényi:

A. N. Kolmogorov. On the Shannon theory of information transmission in the case of continuous signals. IRE Trans. Inf. Theory, IT-2:102–108, Sept. 1956.

J. BALATONI and A. RENYI, Remarks on entropy (in Hungarian with English and Russian summaries), Publications of the Mathematical Institute of the Hungarian Academy of Sciences, I (1956), pp. 9--40.

Renyi introduced the following random variable ($[]$ is the integer part) $$ X_n=\frac1{n}[nX]. $$ Note that this is nothing but looking at the intervals $[\frac kn,\frac{k+1}n)$. Suppose that $H([X])$ exists, which is denoted by $H_0(X)$ in the original paper. The lower dimension of $X$ is defined as following $$ \underline d(X)=\liminf_{n\to\infty}\frac{H([X])}{\log n} $$ and upper dimension of $X$ as: $$ \overline d(X)=\limsup_{n\to\infty}\frac{H([X])}{\log n}. $$ Now if $\overline d(X)=\underline d(X)$, we simply talk about the information dimension of $X$, $d(X)$ and we define the following: $$ H_{d(X)}(X)=\lim_{n\to\infty} (H(X_n)-d(X)\log n). $$ Renyi proved that if $X$ has an absolutely continuous distribution with the density function $f(X)$ and finite $H([X])$, then we can say: $$ d(X)=1\\ H_1(X)=h(X). $$ This is what we discussed above for $\delta=\frac 1n$: $$ H(X_n)=h(X)+\log n $$ Kolmogorov instead introduced the notion of $\epsilon-$entropy which is defined for random variables in abstract metric spaces which is more general.

To answer your question, We can keep the same intuition as the discrete case for differential entropy at least when we use it for finding mutual information or KL-divergence.

For the entropy itself, we have to alter our intuition a little bit. The entropy of discrete random variable means the minimum bits we need to compress the random variable. But for random variables with uncountable supports, we can always "compress" it with another uncountable set of same cardinality (any one-to-one and onto mapping does that). But different random variables with uncountable supports can have different differential entropy.

Related Question