[Math] the motivation of the Kullback-Leibler Divergence

information theory

The Kullback-Leibler Divergence is defined as $$K(f:g) = \int \left(\log \frac{f(x)}{g(x)} \right) \ dF(x)$$

It measures the distance between two distributions $f$ and $g$. Why would this be better than the Euclidean distance in some situations?

Best Answer

The short answer is that KL divergence has a probabilistic/statistical meaning (and a lot of them, in fact) while Euclidean distance has not. For example, a given difference $f(x)-g(x)$ has a whole different meaning depending on the absolute sizes of $f(x)$ and $g(x)$.

The WP page on the subject is a must read, naturally. Let me explain only one interpretation of KL divergence. Assume a random i.i.d. sample $\mathfrak X=(x_k)_{1\leqslant k\leqslant n}$ follows the distribution $f$ and a random i.i.d. sample $\mathfrak Y=(y_k)_{1\leqslant k\leqslant n}$ follows the distribution $g$. A way to distinguish $\mathfrak X$ from $\mathfrak Y$ is to ask for the likelihood that $\mathfrak Y$ behaves like $\mathfrak X$, that is, that $\mathfrak Y$ behaves like a typical sample from $f$.

More precisely, one wants to estimate how unlikely $\mathfrak Y$ becomes when one asks that $\mathfrak Y$ behaves like an $f$ sample, compared to its ordinary likelihood as a $g$ sample.

The computation is rather simple and based on the following. Assume $N(x,x+\mathrm dx)$ values from the sample fall in each interval $(x,x+\mathrm dx)$. Then, the likelihood scales like $$ \prod g(x)^{N(x,x+\mathrm dx)}=\exp\left(\sum N(x,x+\mathrm dx)\log g(x)\right). $$ For a typical $f$ sample, $N(x,x+\mathrm dx)\approx nf(x)\mathrm dx$ when $n\to\infty$, for every $x$, hence the likelihood of $\mathfrak Y$ masquerading as an $f$ sample scales like $$ \ell_n(f\mid g)\approx\exp\left(n\int f(x)\log g(x)\mathrm dx\right). $$ On the other hand, for a typical $g$ sample, $N(x,x+\mathrm dx)\approx ng(x)\mathrm dx$ when $n\to\infty$, for every $x$, hence the likelihood of $\mathfrak Y$ behaving like a typical $g$ sample scales like $$ \ell_n(g\mid g)\approx\exp\left(n\int g(x)\log g(x)\mathrm dx\right). $$ Thus $\ell_n(f\mid g)\ll\ell_n(g\mid g)$, as was to be expected, and the ratio $\dfrac{\ell_n(f\mid g)}{\ell_n(g\mid g)}$ decreases exponentially fast when $n\to\infty$, approximately like $\mathrm e^{-nH}$, where $$ H=\int f(x)\log f(x)\mathrm dx-\int f(x)\log g(x)\mathrm dx=K(f\mid g). $$

Related Question