Why the Kullback-Leibler Divergence is never negative

measure-theorystatistics

The Kullback-Leibler Divergence for the continuous case for two probability densities $p$ and $q$ is

$$D_\text{kl}\left(p,q\right) = \int_{x\in\chi}\left(p(x)\log(p(x)) – p(x)\log(q(x))\right)\text{d}x$$.

Then, how come that for any choice of $q$, $D_\text{kl}\left(p,q\right)\geq0$?


I will attempt to prove by absurd. Suppose that there exists a probability distribution $q$ in which $D_\text{kl}\left(p,q\right)<0$. An initial guess would be to make

$\forall x\in\chi: q(x)>p(x)$

But that doesn't hold since both $p$ and $q$ must have integrals adding up to one:

$$\int_{x\in\chi}q(x)\text{d}x>\int_{x\in\chi}p(x)\text{d}x = 1 \unicode{x21af}$$.

So, $p$ and $q$ must intersect at least once, meaning that we can divide the integral into two parts, one in which $p\geq q$ and another with $p<q$:

$$D_\text{kl}\left(p,q\right) = \int_{x\in\chi_{q\geq p}}\left(p(x)\log(p(x)) – p(x)\log(q(x))\right)\text{d}x + \int_{x\in\chi_{q< p}}\left(p(x)\log(p(x)) – p(x)\log(q(x))\right)\text{d}x$$.

On the second term, choose $q$ so that it gets arbitrarily close to $p$ as possible

$$\forall x\in\chi_{q<p} : q(x) = p(x)-\epsilon(x)$$.

Whereas in the first term, we want $q\geq p$. However, if we choose $q>p$:

$$\int_{x\in\chi}q(x)\text{d}x = \int_{x\in\chi_{q\geq p}}q(x)\text{d}x + \int_{x\in\chi_{q<p}}q(x)\text{d}x>\int_{x\in\chi_{q\geq p}}p(x)\text{d}x + \int_{x\in\chi_{q<p}}p(x)\text{d}x – \int_{x\in\chi_{q<p}}\epsilon(x)\text{d}x$$
$$\int_{x\in\chi}q(x)\text{d}x>1- \int_{x\in\chi_{q<p}}\epsilon(x)\text{d}x$$.

We have an absurd on the limit $\epsilon(x)\to0$. So, the only suitable choice for $q$ is $q=p$.

Best Answer

The most elementary proof uses the inequality $\log t\leq t -1$ for $t >0$, which can be verified by differentiation. Note that restricting the integration in the definition of $D_\text{kl}\left(p,q\right)$ to the set $\{x: p(x)>0\}$ does not affect the value of the integral. Therefore,

$$-D_\text{kl}\left(p,q\right) = \int_{p(x)>0} p(x)\log \frac{q(x)}{p(x)} \,\text{d}x $$ $$\leq \int_{p(x)>0} p(x) \Bigl(\frac{q(x)}{p(x)}-1 \Bigr) \,\text{d}x = \int_{p(x)>0} q(x) \,\text{d}x - \int_{p(x)>0} p(x) \,\text{d}x \leq 0\,.$$

There are other proofs using convexity and Jensen's inequality.