Machine Learning – Minimizing Kullback-Leiber Divergence Using the Hessian and Mathematical Statistics

distributionsinferencemachine learningmathematical-statisticsprobability

Considering two continuous probability distributions $q(x)$ and $p(x)$, The Kullback–Leiber divergence is defined as the measure of the information lost when $q(x)$ is used to approximate $p(x)$.
\begin{equation*}
\mathrm{KL} (p(x)|| q(x)) = \int_{\mathbb{R}} p(x) \ln \frac{p(x)}{q(x)} dx
\end{equation*}

A distribution in the exponential family can be written as
\begin{equation*}
p_\theta(x) = \frac{1}{Z(\theta)} \exp(\theta^T \phi (x))
\end{equation*}

where $\phi(x)$ is the vector of the natural statistics of $x$ and $Z(\theta) = \int \exp(\theta^T \phi(x)) dx$.

My goal is to prove that for the special case of the normal-gamma distribution.
$$(X,T) – \mathrm{NormalGamma}(\mu, \lambda,\alpha,\beta)$$
with pdf:
$$
f(x,t;\mu,\lambda,\alpha,\beta) = \frac{\beta^\alpha \sqrt{\lambda}}{\Gamma(\alpha)\sqrt{2\pi}} t^{\alpha-\frac{1}{2}} e^{-\beta t} \exp\left(-\frac{\lambda t(x-\mu)^2}{2} \right).
$$

we have the following theorem:
the distribution $p_{\theta*}$ which minimises the Kullback–Leibler divergence for the normal-inverse-gamma distribution with probability distribution $p_\theta$ is given by
\begin{equation*}
E_{p_{\theta^*}(x)}[\phi(x)] = E_{p_{(x)}}[\phi(x)]
\end{equation*}

This will imply finding $\theta^*$ such that $\nabla_\theta f(\theta^*) =0$.
Then I will need to find the Hessian to prove that $\theta^*$ is actually the minimum.
My problem is that I was not able to realize those derivations successfully.

Best Answer

To add some clarity to the original paper you cited in the previous post and make notations consistent, I'll write the true density as $p(x)$ and the approximating density as $q_\theta(x)$ parameterized by $\theta$. Since you haven't specified the true density (the one being approximated), I'll try to (re-)explain what the proof means in each step.

The goal is to show that as long as the approximating density $q_\theta(x)$ belongs to an exponential family, minimizing the Kullback-Leibler (KL) divergence $\mathrm{KL}(p\| q_\theta)$ only requires matching the sufficient statistics. First, look at the definition of the KL divergence: \begin{align} \mathrm{KL}(p\| q_\theta) &= \int\log \frac{p(x)}{q_\theta(x)}\, p(x)\, dx \\ &= \mathrm{E}_{p(x)}\left(\log \frac{p(x)}{q_\theta(x)} \right) \\ &= \mathrm{E}_{p(x)}(\log p(x)) - \mathrm{E}_{p(x)}(\log q_\theta(x)). \end{align} Since we need to minimize this (as a function of the parameters $\theta$), we will make this point clear by rewriting it as $f(\theta) = \mathrm{KL}(p\| p_\theta)$, and use the first-order condition $\nabla_\theta f(\theta) = 0$. We see that $\mathrm{E}_{p(x)}(\log p(x))$ in the KL divergence disappears upon differentiation because it's not a function of $\theta$. Therefore, $$ \nabla_\theta f(\theta) = -\nabla_\theta \mathrm{E}_{p(x)}(\log q_\theta(x)). $$ Now, let's go back to the definition of exponential family: $q_\theta(x) = h(x) \exp\{\theta^\top \phi(x) - A(\theta) \} $ where $A(\theta)$ is the log-normalizing constant. Taking the logarithm of this density yields $$ \begin{align} \log q_\theta(x) &= \log h(x) + \theta^\top \phi(x) - A(\theta)\\ \mathrm{E}_{p(x)}(\log q_\theta(x)) &= \mathrm{E}_{p(x)}(\log h(x)) + \theta^\top \mathrm{E}_{p(x)}(\phi(x)) - A(\theta)\\ \nabla_\theta \mathrm{E}_{p(x)}(\log q_\theta(x)) &= \mathrm{E}_{p(x)}(\phi(x)) - \nabla_\theta A(\theta). \end{align} $$ But as I've derived in this answer, $\nabla_\theta A(\theta) = \mathrm{E}_{q_\theta(x)}(\phi(x))$. Therefore, the first-order condition $\nabla_\theta f(\theta) = 0$ gives us $\mathrm{E}_{p(x)}(\phi(x)) = \mathrm{E}_{q_\theta(x)}(\phi(x))$.

To verify that solving the first-order condition for $\theta$ gives us the minimizer, we must compute the Hessian matrix and check if it's positive-definite. From $\nabla_\theta \mathrm{E}_{p(x)}(\log q_\theta(x)) = \mathrm{E}_{p(x)}(\phi(x)) - \nabla_\theta A(\theta)$, it's easily observed that $\nabla_\theta^2 f(\theta) = \nabla_\theta^2 A(\theta)$, which is the covariance matrix of the sufficient statistics. I will not prove this because this is a standard result in mathematical statistics, but refer to lecture notes like this if you're interested. Again, $[\nabla_\theta^2 A(\theta)]_{ij} = \mathrm{Cov}(\phi_i(x),\phi_j(x))$. Covariance matrices are by definition positive-definite, and therefore the first-order condition when solved for $\theta$ indeed produces the minimizer.

Now, since you've asked how this plays out for normal-gamma, we've already established that the sufficient statistics are $\phi_1(x) = \log T$, $\phi_2(x) = T$, $\phi_3(x)=TX$, and $\phi_4(x)=TX^2$. To obtain the covariance matrix in full, you should compute 10 second-order derivatives $\frac{d^2}{d\theta_i d\theta_j} A(\theta)$ for $i,j=1,\ldots,4$ for $$ A(\theta)= -\left(\theta_1+\dfrac{1}{2} \right)\log\left(\frac{\theta_3^2}{4\theta_4} - \theta_2 \right) -\frac{1}{2}\log(-2\theta_4) +\log\Gamma\left(\theta_1+\frac{1}{2}\right) + \frac{1}{2}\log(2\pi), $$ for which I seriously recommend that you consider using tools like Wolfram Alpha.

Related Question