[Math] What characterizations of relative information are known

it.information-theory

Given two probability distributions $p,q$ on a finite set $X$, the quantity variously known as relative information, relative entropy, information gain or Kullback–Leibler divergence is defined to be

$$ D_{KL}(p\|q) = \sum_{i \in X} p_i \ln\left(\frac{p_i}{q_i}\right) $$

This is a natural generalization of the Shannon information of a single probability measure.

I wrote a paper with Tobias Fritz giving a category-theoretic characterization of this quantity:

Belatedly I'm wondering what other characterizations are known. On Wikipedia I read:

Arthur Hobson proved that the Kullback–Leibler divergence is the only measure of difference between probability distributions that satisfies some desiderata, which are the canonical extension to those appearing in a commonly used characterization of entropy.

Unfortunately they don't give these desiderata, and there are lots of different characterizations of entropy, so I don't know which one is meant here. I also don't have quick access to Hobson's book:

  • Arthur Hobson, Concepts in Statistical Mechanics, Gordon and Breach, New York, 1971.

Wikipedia also gives a characterization in terms of coding theory:

The Kullback–Leibler divergence can be interpreted as the expected extra message-length per datum that must be communicated if a code that is optimal for a given (wrong) distribution $q$ is used, compared to using a code based on the true distribution $p$.

This is made precise in Theorem 5.4.3 here:

  • T. M. Cover and J. A. Thomas, Elements of Information Theory, 2nd Edition, Wiley-Interscience, New York, 2006.

What other characterizations of relative entropy are known?

Best Answer

Maximum likelihood Estimation:

Let $X_1,\dots,X_n$ be independently and identically distributed observations from a distribution modeled by the parametric family $\mathcal{F} = \{P_{\theta}:\theta\in\Theta\}$. Let us suppose that all the distributions in $\mathcal{F}$ have a common finite support set $\mathcal{X}$. The maximum likelihood estimation (MLE) corresponds to the probability distribution $P_{\theta}$ which maximizes $\small\prod_{i=1}^n P_{\theta}(X_i)$. Let $\hat{P}$ be the empirical distribution of the observations. Then \begin{eqnarray} \small\frac{\prod_{i=1}^n P_{\theta}(X_i)}{\prod_{i=1}^n \hat P(X_i)} & = & \small\prod_{x\in\mathcal{X}} \Big(\frac{P_{\theta}(x)}{\hat P(x)}\Big)^{n\hat P(x)}\\ & = & \small \exp\Big\{n\sum\limits_{x\in\mathcal{X}}\hat P(x)\log\Big(\frac{P_{\theta}(x)}{\hat P(x)}\Big)\Big\}\\ & = & \small\exp\{-nD(\hat P\|P_{\theta})\}. \end{eqnarray}

Thus maximizing $\small\prod_{i=1}^n P_{\theta}(X_i)$ is same as minimizing $\small D(\hat P\|P_{\theta})$.

Source: I. Csiszar and P. C. Shields, “Information Theory and Statistics: A Tutorial.

Related Question