Information Theory – Why KL Divergence is Non-Negative

information theorykullback-leibler

Why is KL divergence non-negative?

From the perspective of information theory, I have such an intuitive understanding:

Say there are two ensembles $A$ and $B$ which are composed of the same set of elements labeled by $x$. $p(x)$ and $q(x)$ are different probability distributions over ensemble $A$ and $B$ respectively.

From the perspective of information theory, $\log_{2}(P(x))$ is the least amount of bits that required for recording an element $x$ for ensemble $A$. So that the expectation
$$\sum_{x \in ensemble}-p(x)\ln(p(x))$$
can be interpreted as at least how many bits that we need for recording an element in $A$ on average.

Since this formula puts a lower bound on the bits that we need on average, so that for a different ensemble $B$ which brings about a different probability distribution $q(x)$, the bound that it gives for each element $x$ will surely not bit that is given by $p(x)$, which means taking the expectation,
$$\sum_{x\in ensemble}-p(x)\ln(q(x))$$
this average length will surely be greater than the former one, which leads to
$$\sum_{x\in ensemble }p(x)\frac{\ln(p(x))}{\ln(q(x))} > 0$$
I don't put $\ge$ here since $p(x)$ and $q(x)$ are different.

This is my intuitive understanding, is there a purely mathematical way of proving KL divergence is non-negative? The problem can be stated as:

Given $p(x)$ and $q(x)$ are both positive over real line, and $\int_{-\infty}^{+\infty}p(x)dx = 1$, $\int_{-\infty}^{+\infty}q(x)dx = 1$.
Prove $$\int_{-\infty}^{+\infty}p(x)\ln\frac{p(x)}{q(x)}$$ is non-negative.

How can this be proved? Or can this be proved without extra conditions?

Best Answer

Proof 1:

First note that $\ln a \leq a-1$ for all $a \gt 0$.

We will now show that $-D_{KL}(p||q) \leq 0$ which means that $D_{KL}(p||q) \geq 0$

\begin{align} -D(p||q)&=-\sum_x p(x)\ln \frac{p(x)}{q(x)}\\ &= \sum_x p(x)\ln \frac{q(x)}{p(x)}\\ &\stackrel{\text{(a)}}{\leq} \sum_x p(x)\left(\frac{q(x)}{p(x)}-1\right)\\ &=\sum_x q(x) - \sum_x p(x)\\ &= 1 - 1\\ &= 0 \end{align}

For inequality (a) we used the $\ln$ inequality explained in the beginning.

Alternatively you can start with Gibbs' inequality which states: $$ -\sum_x p(x) \log_2 p(x) \leq -\sum_x p(x)\log_2 q(x) $$

Then if we bring the left term to the right we get: $$ \sum_x p(x) \log_2 p(x) - \sum_x p(x)\log_2 q(x)\geq 0 \\ \sum_x p(x)\log_2 \frac{p(x)}{q(x)}\geq 0 $$

The reason I am not including this as a separate proof is because if you were to ask me to prove Gibbs' inequality, I would have to start from the non-negativity of KL divergence and do the same proof from the top.


Proof 2: We use the Log sum inequality: $$ \sum_{i=1}^{n} a_i \log_2 \frac{a_i}{b_i} \geq \left(\sum_{i=1}^{n} a_i\right)\log_2\frac{\sum_{i=1}^{n} a_i}{\sum_{i=1}^{n} b_i} $$

Then we can show that $D_{KL}(p||q) \geq 0$: \begin{align} D(p||q)&=\sum_x p(x)\log_2 \frac{p(x)}{q(x)}\\ &\stackrel{\text{(b)}}{\geq} \left(\sum_x p(x)\right)\log_2\frac{\sum_x p(x)}{\sum_x q(x)}\\ &=1 \cdot \log_2 \frac{1}{1}\\ &=0 \end{align}

where we have used the Log sum inequality at (b).


Proof 3:

(Taken from the book "Elements of Information Theory" by Thomas M. Cover and Joy A. Thomas)

\begin{align} -D(p||q)&=-\sum_x p(x)\log_2 \frac{p(x)}{q(x)}\\ &= \sum_x p(x)\log_2 \frac{q(x)}{p(x)}\\ &\stackrel{\text{(c)}}{\leq} \log_2 \sum_x p(x)\frac{q(x)}{p(x)}\\ &=\log_2 1\\ &=0 \end{align}

where at (c) we have used Jensen's inequality and the fact that $\log$ is a concave function.