Kullback-Leibler Divergence – Why It Is Always Positive

entropyinformation theoryintuitionkullback-leibler

I know there have been mathematical treatments of this question on here. What I'd like help with is my intuitive understanding though. Take the example given on Wikipedia:

$$\begin{array}{|c|c|c|c|} \hline
x&0&1&2\\ \hline
P(x)&0.36&0.48&0.16\\
\hline
Q(x)&0.33&0.33&0.33\\ \hline
\end{array}$$

Where $D_{KL}(P||Q) = 0.0852996$ and $D_{KL}(Q||P) = 0.097455$. On the one hand, I think I understand that information is gained in both cases because the distribution changes, rather than remains the same (so some information has been gained about the likely value of $x$). But at the same time I can't shake the intuition that there should've been information loss for $D_{KL}(P||Q)$ because $Q(x)$ has greater entropy than $P(X)$. Can someone help to correct my intuitions? How is there information gain while entropy simultaneously increases?

Best Answer

Intuitive understanding is somewhat subjective, but I can at least offer my perspective:

Kullback-Leibler divergence is a concept from Information Theory. It tells you how much longer --- how many bits --- on average are your messages going to be if you use a suboptimal coding scheme.

For every probability distribution, there is a lower bound on the average message length, and that is the entropy of the distribution. For the distribution $P$ from your Wikipedia example, it is

$$ - \sum_x P(x) \cdot \log_2 P(x) \approx 1.462 $$

That is, if you were to record realisations of random variables from that probability distribution, e.g. in a computer file, or transmit them over a limited-bandwidth channel, you'd need, on average, at least $1.462$ bits per realisation, no matter how sophisticated your coding is. Since in that distribution the case $x = 2$ is three times as probable as $x = 3$, it makes sense to use a shorter code for encoding the event $x=2$ than for encoding $x=3$. You could, for example, use the following encoding:

   x:    1       2       3
code:   01       1     001

The average message length with this code is $1.68$ bits, which is (of course!) more than the theoretical lower bound, but still better than an equal-length code, e.g.:

   x:    1       2       3
code:   01      10      11

which would need $2$ bits per event. You can construct more complex codes to encode sequences of events, but no matter what you do, you won't be able to beat the information-theoretical lower bound.

Now, for a different distribution, say $Q$, there are other encodings that approximate the best possible coding. The entropy of $Q$ from your example is $\approx 1.583$ bits. As approximations, both above codes are equally good, requiring on average $2$ bits per event, but more complex codes might be better.

However, what is better for encoding $Q$ is not necessarily better for encoding $P$. Kullback-Leibler divergence tells you how many bits does it costs you to use a coding optimised for transmitting/storing information on $Q$ if your true probability distribution is $P$. This measure cannot be negative. If it were, it would mean that you could beat the optimal coding for $P$ by using the coding optimised for $Q$ instead.

Indeed, the KL-divergence $D_{KL}(P||P) = 0$ (easy to show, because $\log(p(x)/p(x)) = \log(1) = 0$) tells you that encoding the probability distribution $P$ with a code optimised for that distribution incurs zero costs.