Solved – Obtaining Shannon entropy from “KL-divergence to uniform distribution”

cross entropyentropyinformation theorykullback-leibler

Suppose I have a probability distribution $P$, and suppose that $U$ is the uniform distribution on the same sample space. Then the KL divergence from P to U is

$D_{KL}(U||P) = \sum_x u(x) \log\frac{u(x)}{p(x)}$

We can decompose this as follows:

$D_{KL}(U||P) = \sum_x u(x) \log u(x) – \sum_x u(x) \log p(x)$

The first term is the negative of the entropy of the uniform distribution, or $\log N$ where $N$ is the sample space. The second term is the cross entropy from P to U. So we get

$D_{KL}(U||P) = H(U||P) – H(U) = H(U||P) – \log(N)$

My question: given just the above quantities, is there some way to compute $H(P)$, the entropy of $P$?

Phrased another way: given the KL-divergence from some distribution $P$ to the uniform distribution, is the entropy of $P$ uniquely specified?

Best Answer

It seems the answer is no. According to your expression:

$$D_{KL} (u \parallel p) = -\log n - \frac{1}{n} \sum_x \log p(x)$$

The entropy of $p$ is

$$H(p) = -\sum_x p(x) \log p(x)$$

But, there's no way to recover this from the first expression. Given $D_{KL}(u \parallel p)$ (and no knowledge of $p$), the closest we could come is to add $\log n$ then multiply by $n$ to obtain $-\sum_x \log p(x)$. But, this sum collapses everything into a single value, and we cant recover the individual probabilities needed to compute the entropy.

But, what you want to do is possible if the KL divergence is computed the other way around:

$$D_{KL}(p \parallel u) = \sum_x p(x) \log p(x) - \sum_x p(x) \log u(x)$$

The first term is the negative entropy:

$$= -H(p) - \sum_x p(x) \log u(x)$$

Plug in the uniform probabilities:

$$= -H(p) - \sum_x p(x) \log \frac{1}{n}$$

Simplify the last term (noting that the distribution sums to one):

$$= -H(p) + \log n$$

Therefore:

$$H(p) = \log n - D_{KL}(p \parallel u)$$

Best Answer

Related Solutions

Why Use Kullback-Leibler Divergence Over Cross Entropy in t-SNE?

Neural Networks – Cross Entropy vs KL Divergence: What’s Minimized Directly in Practice?

Related Question