Analytically compute KL divergence of two Gaussian distributions

calculusentropyintegrationprobability

Consider two multi-variate Gaussian distributions, $p(x)=\mathcal N(x;\mu_p, \sigma_p^2)$ and $q(x)=\mathcal N(x; \mu_q, \sigma_q^2)$. It seems the KL-divergence of these two Gaussian distributions $D_{KL}(p(x)\Vert q(x))$ can be calculated analytically(according to the paper "Auto-Encoding Variational Bayes"), but I don't know how. "Auto-Encoding Variational Bayes" gives the result of a more special case where $q(x)=\mathcal N(x;\mathbf 0, \mathbf I)$, but no detailed procedure is provided.

Ps. I know how univariate Gaussian is derived.

$$
\begin{align}
D_{KL}(p(x)\Vert q(x))&=\int p(x)\log {p(x)\over q(x)}dx\\
&=\int p(x)\log {{1\over \sqrt{2\pi \sigma_p^2}}\exp\left(-{(x-\mu_p)^2\over 2\sigma_p^2}\right)\over{1\over \sqrt{2\pi \sigma_q^2}}\exp\left(-{(x-\mu_q)^2\over 2\sigma_q^2}\right)}dx\\
&={1\over 2}\int\log{\sigma_q^2\over \sigma_p^2}p(x)dx – {1\over 2\sigma_p^2}\int(x-\mu_p)^2p(x)dx+{1\over2\sigma_q^2}\int(x-\mu_q)^2p(x)dx\\
&={1\over 2}\left(\log{\sigma_q^2\over \sigma_p^2} – 1 + {1\over\sigma_q^2}\int(x-\mu_p + \mu_p – \mu_q)^2p(x)dx\right)\\
&={1\over 2}\left(\log{\sigma_q^2\over \sigma_p^2} – 1 + {1\over\sigma_q^2}\int\left((x-\mu_p)^2 + 2(x-\mu_p)(\mu_p-\mu_q) + (\mu_p – \mu_q)^2\right)p(x)dx\right)\\
&={1\over 2}\left(\log{\sigma_q^2\over \sigma_p^2} – 1 + {1\over\sigma_q^2}\left(\sigma_p^2 + (\mu_p – \mu_q)^2\right)\right)\\
\end{align}
$$

but I'm not very familiar with the multi-variate case, it seems the result involves the sum over every variate:

$$
D_{KL}(p(x)\Vert q(x))={1\over 2}\sum_{j\in J}\left(\log{\sigma_{q, j}^2\over \sigma_{p,j}^2} – 1 + {1\over\sigma_{q,j}^2}\left(\sigma_{p,j}^2 + (\mu_{p,j} – \mu_{q,j})^2\right)\right)\\
$$

where $J$ is the dimension of $x$. But why do we sum over all the variates?

Please help me sort this out. Thanks in advance!

Best Answer

According to http://101.110.118.57/stanford.edu/~jduchi/projects/general_notes.pdf, the KL divergence for two multivariate Gaussians in $\mathbb R^n$ is computed as follows

$$ \begin{align} D_{KL}(P_1\Vert P_2) &= {1\over 2}E_{P_1}\left[-\log\det\Sigma_1-(x-\mu_1)\Sigma_1^{-1}(x-\mu_1)^{T}+\log\det\Sigma_2+(x-\mu_2)\Sigma_2^{-1}(x-\mu_2)^{T}\right]\\ &={1\over 2}\left(\log{\det\Sigma_2\over \det\Sigma_1}+E_{P_1}\left[-tr\big((x-\mu_1)\Sigma_1^{-1}(x-\mu_1)^{T})\big)+tr\big((x-\mu_2)\Sigma_2^{-1}(x-\mu_2)^{T}\big)\right]\right)\\ &={1\over 2}\left(\log{\det\Sigma_2\over \det\Sigma_1}+E_{P_1}\left[-tr\big(\Sigma_1^{-1}(x-\mu_1)^{T}(x-\mu_1))\big)+tr\big(\Sigma_2^{-1}(x-\mu_2)^{T}(x-\mu_2)\big)\right]\right)\\ &={1\over 2}\left(\log{\det\Sigma_2\over \det\Sigma_1}-n+E_{P_1}\left[tr\Big(\Sigma_2^{-1}\big(xx^T-2x\mu_2^{T}+\mu_2\mu_2^T\big)\Big)\right]\right)\\ &={1\over 2}\left(\log{\det\Sigma_2\over \det\Sigma_1}-n+E_{P_1}\left[tr\Big(\Sigma_2^{-1}\big(\Sigma_1+2x\mu_1^T-\mu_1\mu_1^T-2x\mu_2^{T}+\mu_2\mu_2^T\big)\Big)\right]\right)\\ &={1\over 2}\left(\log{\det\Sigma_2\over \det\Sigma_1}-n+tr(\Sigma_2^{-1}\Sigma_1)+tr\Big(\Sigma_2^{-1}E_{P_1}\left[2x\mu_1^T-\mu_1\mu_1^T-2x\mu_2^{T}+\mu_2\mu_2^T\right]\Big)\right)\\ &={1\over 2}\left(\log{\det\Sigma_2\over \det\Sigma_1}-n+tr(\Sigma_2^{-1}\Sigma_1)+tr\Big(\mu_1^T\Sigma_2^{-1}\mu_1-2\mu_1^T\Sigma_2^{-1}\mu_2+\mu_2^T\Sigma_2^{-1}\mu_2\Big)\right)\\ &={1\over 2}\left(\log{\det\Sigma_2\over \det\Sigma_1}-n+tr(\Sigma_2^{-1}\Sigma_1)+tr\big((\mu_1-\mu_2)^T\Sigma_2^{-1}(\mu_1-\mu_2)\big)\right)\\ \end{align} $$ where the second step is obtained because for any scalar $a$, we have $a=tr(a)$. And $tr\left(\prod_{i=1}^nF_{i}\right)=tr\left(F_n\prod_{i=1}^{n-1}F_i\right)$ is applied whenever necessary.

The last equation is equal to the equation in the question when $\Sigma$s are diagonal matrices