Why is KL divergence non-negative?
From the perspective of information theory, I have such an intuitive understanding:
Say there are two ensembles $A$ and $B$ which are composed of the same set of elements labeled by $x$. $p(x)$ and $q(x)$ are different probability distributions over ensemble $A$ and $B$ respectively.
From the perspective of information theory, $\log_{2}(P(x))$ is the least amount of bits that required for recording an element $x$ for ensemble $A$. So that the expectation
$$\sum_{x \in ensemble}-p(x)\ln(p(x))$$
can be interpreted as at least how many bits that we need for recording an element in $A$ on average.
Since this formula puts a lower bound on the bits that we need on average, so that for a different ensemble $B$ which brings about a different probability distribution $q(x)$, the bound that it gives for each element $x$ will surely not bit that is given by $p(x)$, which means taking the expectation,
$$\sum_{x\in ensemble}-p(x)\ln(q(x))$$
this average length will surely be greater than the former one, which leads to
$$\sum_{x\in ensemble }p(x)\frac{\ln(p(x))}{\ln(q(x))} > 0$$
I don't put $\ge$ here since $p(x)$ and $q(x)$ are different.
This is my intuitive understanding, is there a purely mathematical way of proving KL divergence is non-negative? The problem can be stated as:
Given $p(x)$ and $q(x)$ are both positive over real line, and $\int_{-\infty}^{+\infty}p(x)dx = 1$, $\int_{-\infty}^{+\infty}q(x)dx = 1$.
Prove $$\int_{-\infty}^{+\infty}p(x)\ln\frac{p(x)}{q(x)}$$ is non-negative.
How can this be proved? Or can this be proved without extra conditions?
Best Answer
Proof 1:
First note that $\ln a \leq a-1$ for all $a \gt 0$.
We will now show that $-D_{KL}(p||q) \leq 0$ which means that $D_{KL}(p||q) \geq 0$
\begin{align} -D(p||q)&=-\sum_x p(x)\ln \frac{p(x)}{q(x)}\\ &= \sum_x p(x)\ln \frac{q(x)}{p(x)}\\ &\stackrel{\text{(a)}}{\leq} \sum_x p(x)\left(\frac{q(x)}{p(x)}-1\right)\\ &=\sum_x q(x) - \sum_x p(x)\\ &= 1 - 1\\ &= 0 \end{align}
For inequality (a) we used the $\ln$ inequality explained in the beginning.
Alternatively you can start with Gibbs' inequality which states: $$ -\sum_x p(x) \log_2 p(x) \leq -\sum_x p(x)\log_2 q(x) $$
Then if we bring the left term to the right we get: $$ \sum_x p(x) \log_2 p(x) - \sum_x p(x)\log_2 q(x)\geq 0 \\ \sum_x p(x)\log_2 \frac{p(x)}{q(x)}\geq 0 $$
The reason I am not including this as a separate proof is because if you were to ask me to prove Gibbs' inequality, I would have to start from the non-negativity of KL divergence and do the same proof from the top.
Proof 2: We use the Log sum inequality: $$ \sum_{i=1}^{n} a_i \log_2 \frac{a_i}{b_i} \geq \left(\sum_{i=1}^{n} a_i\right)\log_2\frac{\sum_{i=1}^{n} a_i}{\sum_{i=1}^{n} b_i} $$
Then we can show that $D_{KL}(p||q) \geq 0$: \begin{align} D(p||q)&=\sum_x p(x)\log_2 \frac{p(x)}{q(x)}\\ &\stackrel{\text{(b)}}{\geq} \left(\sum_x p(x)\right)\log_2\frac{\sum_x p(x)}{\sum_x q(x)}\\ &=1 \cdot \log_2 \frac{1}{1}\\ &=0 \end{align}
where we have used the Log sum inequality at (b).
Proof 3:
(Taken from the book "Elements of Information Theory" by Thomas M. Cover and Joy A. Thomas)
\begin{align} -D(p||q)&=-\sum_x p(x)\log_2 \frac{p(x)}{q(x)}\\ &= \sum_x p(x)\log_2 \frac{q(x)}{p(x)}\\ &\stackrel{\text{(c)}}{\leq} \log_2 \sum_x p(x)\frac{q(x)}{p(x)}\\ &=\log_2 1\\ &=0 \end{align}
where at (c) we have used Jensen's inequality and the fact that $\log$ is a concave function.