Solved – Kullback-Leibler divergence WITHOUT information theory

compressionentropyinferenceinformation theorykullback-leibler

After much trawling of Cross Validated, I still don't feel like I'm any closer to understanding KL divergence outside of the realm of information theory. It's rather odd as somebody with a Math background to find it much easier to understand the information theory explanation.

To outline my understanding from an information theory background:
If we have a random variable with a finite number of outcomes, there exists an optimal encoding which allows us to communicate the outcome with somebody else with on average the shortest message (I find this easiest to picture in terms of bits). The expected length of the message one would need to communicate the outcome is given by $$ -\sum _{\alpha}p_{\alpha}\log_{2}(p_{\alpha})$$ if the optimal encoding is used. If you were to use a sub optimal encoding, then KL divergence tells us on average how much longer our message would be.

I like this explanation, because it quite intuitively deals with the asymmetry of KL divergence. If we have two different systems, i.e. two loaded coins that are differently loaded, they will have different optimal encodings. I don't somehow instinctively feel that using the second system's encoding for the first is "equally bad" to using the first system's encoding for the second. Without going through the thought process of how I convinced myself, I'm now fairly happy that $$\sum _{\alpha}p_{\alpha}( \log _{2}q_{\alpha}-\log_{2}p_{\alpha})$$ gives you this "extra expected message length", when using $q$'s encoding for $p$.

However, most definitions of KL divergence, including Wikipedia then make the statement (keeping this in discrete terms so that it can be compared to the information theory interpretation which works far better in discrete terms as bits are discrete) that if we have two discrete probability distributions, then KL provides some metric of "how different they are". I have yet to see a single explanation of how these two concepts are even related. I seem to remember in his book on inference, Dave Mackay makes points about how data compression and inference are basically the same thing, and I suspect my question is really related to this.

Regardless of whether it is or it isn't, the sort of question I have in mind is around problems of inference. (Keeping things discrete), if we have two radioactive samples, and we know that one of them is a certain material with known radioactivity (this is dubious physics but let's pretend the universe works like that) and thus we know the "true" distribution of radioactive clicks we should measure should be poissonian with known $\lambda $, is it fair to build up an empirical distribution for both samples and compare their KL divergences to the known distribution and say the lower is more likely to be that material?

Moving away from dubious physics, if I know two samples are pulled from the same distribution but I know they're not randomly selected, would comparing their KL divergences to the known, global distribution give me a feel for "how biased" the samples are, relative to one and other anyway?

And finally, if the answer to the previous questions is yes, then why? Is it possible to understand these things from a statistical point of view alone without making any (possibly tenuous) connections to information theory?

Best Answer

There is a purely statistical approach to Kullback-Leibler divergence: take a sample $X_1,\ldots,X_n$ iid from an unknown distribution $p^\star$ and consider the potential fit by a family of distributions, $$\mathfrak{F}=\{p_\theta\,,\ \theta\in\Theta\}$$The corresponding likelihood is defined as $$L(\theta|x_1,\ldots,x_n)=\prod_{i=1}^n p_\theta(x_i)$$ and its logarithm is $$\ell(\theta|x_1,\ldots,x_n)=\sum_{i=1}^n \log p_\theta(x_i)$$ Therefore, $$\frac{1}{n} \ell(\theta|x_1,\ldots,x_n) \longrightarrow \mathbb{E}[\log p_\theta(X)]=\int \log p_\theta(x)\,p^\star(x)\text{d}x$$ which is the interesting part of the Kullback-Leibler divergence between $p_\theta$ and $p^\star$ $$\mathfrak{H}(p_\theta|p^\star)\stackrel{\text{def}}{=}\int \log \{p^\star(x)/p_\theta(x)\}\,p^\star(x)\text{d}x$$the other part$$\int \log \{p^\star(x)\}\,p^\star(x)\text{d}x$$being there to have the minimum [in $\theta$] of $\mathfrak{H}(p_\theta|p^\star)$ equal to zero.

A book that connects divergence, information theory and statistical inference is Rissanen's Optimal estimation of parameters, which I reviewed here.