Kullback-Leibler – An Adaptation of the Kullback-Leibler Distance

kullback-leibler

Look at this picture:
enter image description here

If we draw a sample from the red density then some values are expected to be less than 0.25 whereas it is impossible to generate such a sample from the blue distribution. As a consequence, the Kullback-Leibler distance from the red density to the blue density is infinity. However, the two curves are not that distinct, in some "natural sense".

Here is my question: Does it exist an adaptation of the Kullback-Leibler distance that would allow a finite distance between these two curves?

Best Answer

You might look at Chapter 3 of Devroye, Gyorfi, and Lugosi, A Probabilistic Theory of Pattern Recognition, Springer, 1996. See, in particular, the section on $f$-divergences.

$f$-Divergences can be viewed as a generalization of Kullback--Leibler (or, alternatively, KL can be viewed as a special case of an $f$-Divergence).

The general form is $$ D_f(p, q) = \int q(x) f\left(\frac{p(x)}{q(x)}\right) \, \lambda(dx) , $$

where $\lambda$ is a measure that dominates the measures associated with $p$ and $q$ and $f(\cdot)$ is a convex function satisfying $f(1) = 0$. (If $p(x)$ and $q(x)$ are densities with respect to Lebesgue measure, just substitute the notation $dx$ for $\lambda(dx)$ and you're good to go.)

We recover KL by taking $f(x) = x \log x$. We can get the Hellinger difference via $f(x) = (1 - \sqrt{x})^2$ and we get the total-variation or $L_1$ distance by taking $f(x) = \frac{1}{2} |x - 1|$. The latter gives

$$ D_{\mathrm{TV}}(p, q) = \frac{1}{2} \int |p(x) - q(x)| \, dx $$

Note that this last one at least gives you a finite answer.

In another little book entitled Density Estimation: The $L_1$ View, Devroye argues strongly for the use of this latter distance due to its many nice invariance properties (among others). This latter book is probably a little harder to get a hold of than the former and, as the title suggests, a bit more specialized.


Addendum: Via this question, I became aware that it appears that the measure that @Didier proposes is (up to a constant) known as the Jensen-Shannon Divergence. If you follow the link to the answer provided in that question, you'll see that it turns out that the square-root of this quantity is actually a metric and was previously recognized in the literature to be a special case of an $f$-divergence. I found it interesting that we seem to have collectively "reinvented" the wheel (rather quickly) via the discussion of this question. The interpretation I gave to it in the comment below @Didier's response was also previously recognized. All around, kind of neat, actually.

Related Question