Solved – information-theoric about the Kullback-Leibler divergence

information theorykullback-leiblermathematical-statisticsreferences

Statistical references often present the Kullback-Leibler and $f$-divergences as being information-theoric in nature. Some examples:

Yet my understanding of the KL divergence is entirely statistical: it is an expected weight of evidence, it is invariant under sufficient statistics, it (and other $f$-divergences) relate to the asymptotics of the likelihood ratio, etc.

In what sense is the Kullback-Leibler divergence information-theoric in nature? How are statistical arguments based on the Kullback-Leibler divergence of an information-theoric flavor?

Notes: I'm looking for a precise answer that would rigorously justify the use of "information theory" in the titles above, or argue that it is only a buzzword. I'm considering general infinite sample spaces.

See also: Kullback-Leibler divergence WITHOUT information theory.

Best Answer

This question attempts to explain (1) the information-theoretic interpretation of KL divergence, (2) how such an application lends itself to Bayesian analysis. What follows is directly quoted from pp. 148-150, Section 6.6. of Stone's Information Theory, a very good book which I recommend.

Kullback-Leibler divergence (KL-divergence) is a general measure of the difference between two distributions, and is also known as the relative entropy. Given two distributions $p(X)$ and $q(X)$ of the same variable $X$, the KL-divergence between these distributions is $$ D_{KL}(p(X)||q(X))=\int_x p(x) \log\frac{p(x)}{q(x)}dx \,.$$ KL-divergence is not a true measure of distance because, usually $$D_{KL}(p(X)||q(X)) \not= D_{KL}(q(X)||p(X)) \,.$$ Note that $D_{KL}(p(X)||q(X))>0$, unless $p=q$, in which case it is equal to zero.

The KL-divergence between the joint distribution $p(X,Y)$ and the joint distribution $[p(X)p(Y)]$ obtained from the outer product of the marginal distributions $p(X)$ and $p(Y)$ is $$D_{KL}(p(X,Y)||[p(X)p(Y)])=\int_x \int_y p(x,y) \log\frac{p(x,y)}{p(x)p(y)} dy dx $$ which we can recognize from Equation 6.25 $$I(X,Y) = \int_y\int_x p(x,y)\log\frac{p(x,y)}{p(x)p(y)}dx dy $$ as the mutual information between $X$ and $Y$.

Thus the mutual information between $X$ and $Y$ is the KL-divergence between the joint distribution $p(X,Y)$ and the joint distribution $[p(X)p(Y)]$ obtained by evaluating the outer product of the marginal distributions of $p(X)$ and $p(Y)$.

Bayes' Rule

We can express the KL-divergence between two variables in terms of Bayes' rule (see Stone (2013)$^{52}$ and Appendix F). Given that $p(x,y)=p(x|y)p(y)$, mutual information can be expressed as $$I(X,Y) = \int_y p(y) \int_x p(x|y)\log\frac{p(x|y)}{p(x)}dx dy \,, $$ where the inner integral can be recognized as the KL-divergence between the distributions $p(X|y)$ and $p(X)$, $$D_{KL}(p(X|y)||p(X))=\int_x p(x|y)\log\frac{p(x|y)}{p(x)}dx \,, $$ where $p(X|y)$ is the posterior distribution and $p(X)$ is the prior distribution. Thus, the mutual information between $X$ and $Y$ is $$I(X,Y) = \int_y p(y) D_{KL}(p(X|y)||p(X))dy \,, $$ which is the expected KL-divergence between the posterior and the prior, $$I(X,Y) = \mathbb{E}_y [D_{KL}(p(X|y)||p(X))]\,, $$ where the expectation is taken over values of $Y$.

The application to Bayesian analysis can be found in Appendix H, pp. 157-158 of Stone's also very good book Bayes' Rule.

Reference Priors

The question of what constitutes an un-biased or fair prior has several answers. Here, we provide a brief account of the answer given by Bernardo(1979)$^3$, who called them reference priors.

Reference priors rely on the idea of mutual information. In essence, the mutual information between two variables is a measure of how tightly coupled they are, and can be considered to be a general measure of the correlation between variables. More formally, it is the average amount of Shannon information conveyed about one variable by the other variable. For our purposes, we note that the mutual information $I(x,\theta)$ between $x$ and $\theta$ is also the average difference between the posterior $p(\theta|x)$ and the prior $p(\theta)$, where this difference is measured as the Kullback-Leibler divergence. A reference prior is defined as that particular prior which makes the mutual information between $x$ and $\theta$ as large as possible, and (equivalently) maximizes the average Kullback-Leibler divergence between the posterior and the prior.

What has this to do with fair priors? A defining, and useful, feature of mutual information is that it is immune or invariant to the effects of transformations of variables. For example, if a measurement device adds a constant amount $k$ to each reading, so that we measure $x$ as $y=x+k$, then the mean $\theta$ becomes $\phi=\theta+k$, where $\theta$ and $\phi$ are location parameters. Despite the addition of $k$ to measured values, the mutual information between $\phi$ and $y$ remains the same as the mutual information between $\theta$ and $x$; that is, $I(y,\phi) = I(x,\theta)$. Thus, the fairness of a prior (defined in terms of transformation invariance) is guaranteed if we choose a common prior for $\theta$ and $\phi$ which ensures that $I(y,\phi) = I(x,\theta)$. Indeed, it is possible to harness this equality to derive priors which have precisely the desired invariance. It can be shown that the only prior that satisfies this equality for a location parameter (such as the mean) is the uniform prior...