Kullback-Leibler divergence nonnegative proof

entropyexpected valueprobabilityproof-explanation

I am given an example and proof of the Kullback-Leibler divergence:

Let $\mathbf{p} = (p_1, \dots, p_n)$ and $\mathbf{r} = (r_1, \dots, r_n)$ be probability vectors (so each is nonnegative and sums to $1$). Think of each as a possible PMF for a random variable whose support consists of $n$ distinct values. The Kullback-Leibler divergence between $\mathbf{p}$ and $\mathbf{r}$ is defined as

$$D(\mathbf{p}, \mathbf{r}) = \sum_{j = 1}^n p_j \log_2 (1/r_j) – \sum_{j = 1}^n p_j \log_2 (1/p_j).$$

This is the difference between the average surprise we will experience when the actual probabilities are $\mathbf{p}$ but we are instead working with $\mathbf{r}$ (for example, if $\mathbf{p}$ is unknown and $\mathbf{r}$ is our current guess for $\mathbf{p}$), and our average surprise when we work with $\mathbf{p}$. Show that the Kullback-Leibler divergence is nonnegative.

Solution:

Using properties of logs, we have

$$D(\mathbf{p}, \mathbf{r}) = – \sum_{j = 1}^n p_j \log_2 \left( \dfrac{r_j}{p_j} \right).$$

Let $Y$ be a random variable that takes on values $r_j/p_j$ with probabilities $p_j$, so that $D(\mathbf{p}, \mathbf{r}) = -E(\log_2(Y))$ by LOTUS. Then by Jensen's inequality,

$$D(\mathbf{p}, \mathbf{r}) = -E(\log_2(Y)) \ge – \log_2(E(Y)) = -\log_2(1) = 0,$$

with equality if and only if $\mathbf{p} = \mathbf{r}$. This result tells us that we're more surprised on average when we work with the wrong probabilities than when we work with the correct probabilities.

I was wondering how the authors got that $E(Y) = 1$ in $- \log_2(E(Y)) = -\log_2(1)$? Thank you.

Best Answer

$E(Y) = \sum p_j \cdot \frac{r_j}{p_j} = \sum r_j = 1 $