Solved – Smallest Kullback-Leibler divergence

information theorykullback-leiblerprobability

Suppose we seek to approximate an arbitrary distribution $p_1(x)$ by a normal
$p_2(x) \sim \mathcal N(\mu, \Sigma)$. How can I show that the values that lead to the smallest Kullback–Leibler
divergence are:
$$
\mu_1 = \mathbb E_1[X]
$$
and
$$
\Sigma_1 = \mathbb E_1[(X − \mu)(X − \mu)^T],
$$
where the notation $\mathbb E_1(\cdot)$ indicates the expectation is taken over the density $p_1(x)$?

For reference, the definition of Kullback–Leibler divergence is
$$
D( p_1 \|\; p_2) = \int p_1 \log(p_1/p_2) \text{d}\lambda \> .
$$

Best Answer

If you express the Kullback–Leibler divergence when $p_2$ is a normal pdf on $\mathbb R^d$, \begin{align} D&(p_1||p_2) =\int_{\mathbb R^d} p_1 \log p_1 \text{d}\lambda - \int_{\mathbb R^d} p_1 \log p_2 \text{d}\lambda\\ &= \int_{\mathbb R^d} p_1 \log p_1 \text{d}\lambda - \dfrac{1}{2} \int_{\mathbb R^d} p_1 \left\{-(x-\mu)^T \Sigma^{-1} (x-\mu) - \log |\Sigma| -d \log 2\pi \right\} \text{d}\lambda \\ &= \int_{\mathbb R^d} p_1 \log p_1 \text{d}\lambda + \dfrac{1}{2} \left\{ \log |\Sigma| + d \log 2\pi + \mathbb{E}_1 \left[ (x-\mu)^T \Sigma^{-1} (x-\mu) \right] \right\} \end{align} Now $$ \mathbb{E}_1 \left[ (x-\mu)^T \Sigma^{-1} (x-\mu) \right]= \mathbb{E}_1 \left[ (x-\mathbb{E}_1[x] )^T \Sigma^{-1} (x-\mathbb{E}_1[x]) \right]$$ $$\qquad\qquad\qquad + (\mathbb{E}_1[x]-\mu)^T \Sigma^{-1} (\mathbb{E}_1[x]-\mu) $$ so the minimum in $\mu$ is indeed reached for $\mu=\mathbb{E}_1[x]$.

Minimising $$ \log |\Sigma| + \mathbb{E}_1 \left[ (x-\mathbb{E}_1[x] )^T \Sigma^{-1} (x-\mathbb{E}_1[x]) \right] = $$ $$ \log |\Sigma| + \mathbb{E}_1 \left[ \text{trace} \left\{ (x-\mathbb{E}_1[x] )^T \Sigma^{-1} (x-\mathbb{E}_1[x]) \right\}\right] = \qquad\qquad\qquad $$ $$ \log |\Sigma| + \mathbb{E}_1 \left[ \text{trace} \left\{ \Sigma^{-1} (x-\mathbb{E}_1[x]) (x-\mathbb{E}_1[x] )^T \right\}\right] = $$ $$ \log |\Sigma| + \text{trace} \left\{ \Sigma^{-1} \mathbb{E}_1 \left[ (x-\mathbb{E}_1[x]) (x-\mathbb{E}_1[x] )^T \right] \right\}= $$ $$ \qquad\qquad \log |\Sigma| + \text{trace} \left\{ \Sigma^{-1} \Sigma_1 \right\} $$ leads to a minimum in $\Sigma$ for $\Sigma=\Sigma_1$.