Considering two continuous probability distributions $q(x)$ and $p(x)$, The Kullback–Leiber divergence is defined as the measure of the information lost when $q(x)$ is used to approximate $p(x)$.

\begin{equation*}

\mathrm{KL} (p(x)|| q(x)) = \int_{\mathbb{R}} p(x) \ln \frac{p(x)}{q(x)} dx

\end{equation*}

A distribution in the exponential family can be written as

\begin{equation*}

p_\theta(x) = \frac{1}{Z(\theta)} \exp(\theta^T \phi (x))

\end{equation*}

where $\phi(x)$ is the vector of the natural statistics of $x$ and $Z(\theta) = \int \exp(\theta^T \phi(x)) dx$.

My goal is to prove that for the special case of the normal-gamma distribution.

$$(X,T) – \mathrm{NormalGamma}(\mu, \lambda,\alpha,\beta)$$

with pdf:

$$

f(x,t;\mu,\lambda,\alpha,\beta) = \frac{\beta^\alpha \sqrt{\lambda}}{\Gamma(\alpha)\sqrt{2\pi}} t^{\alpha-\frac{1}{2}} e^{-\beta t} \exp\left(-\frac{\lambda t(x-\mu)^2}{2} \right).

$$

we have the following theorem:

the distribution $p_{\theta*}$ which minimises the Kullback–Leibler divergence for the normal-inverse-gamma distribution with probability distribution $p_\theta$ is given by

\begin{equation*}

E_{p_{\theta^*}(x)}[\phi(x)] = E_{p_{(x)}}[\phi(x)]

\end{equation*}

This will imply finding $\theta^*$ such that $\nabla_\theta f(\theta^*) =0$.

Then I will need to find the Hessian to prove that $\theta^*$ is actually the minimum.

My problem is that I was not able to realize those derivations successfully.

## Best Answer

To add some clarity to the original paper you cited in the previous post and make notations consistent, I'll write the true density as $p(x)$ and the approximating density as $q_\theta(x)$ parameterized by $\theta$. Since you haven't specified the true density (the one being approximated), I'll try to (re-)explain what the proof means in each step.

The goal is to show that as long as the approximating density $q_\theta(x)$ belongs to an exponential family, minimizing the Kullback-Leibler (KL) divergence $\mathrm{KL}(p\| q_\theta)$ only requires matching the sufficient statistics. First, look at the definition of the KL divergence: \begin{align} \mathrm{KL}(p\| q_\theta) &= \int\log \frac{p(x)}{q_\theta(x)}\, p(x)\, dx \\ &= \mathrm{E}_{p(x)}\left(\log \frac{p(x)}{q_\theta(x)} \right) \\ &= \mathrm{E}_{p(x)}(\log p(x)) - \mathrm{E}_{p(x)}(\log q_\theta(x)). \end{align} Since we need to minimize this (as a function of the parameters $\theta$), we will make this point clear by rewriting it as $f(\theta) = \mathrm{KL}(p\| p_\theta)$, and use the first-order condition $\nabla_\theta f(\theta) = 0$. We see that $\mathrm{E}_{p(x)}(\log p(x))$ in the KL divergence disappears upon differentiation because it's not a function of $\theta$. Therefore, $$ \nabla_\theta f(\theta) = -\nabla_\theta \mathrm{E}_{p(x)}(\log q_\theta(x)). $$ Now, let's go back to the definition of exponential family: $q_\theta(x) = h(x) \exp\{\theta^\top \phi(x) - A(\theta) \} $ where $A(\theta)$ is the log-normalizing constant. Taking the logarithm of this density yields $$ \begin{align} \log q_\theta(x) &= \log h(x) + \theta^\top \phi(x) - A(\theta)\\ \mathrm{E}_{p(x)}(\log q_\theta(x)) &= \mathrm{E}_{p(x)}(\log h(x)) + \theta^\top \mathrm{E}_{p(x)}(\phi(x)) - A(\theta)\\ \nabla_\theta \mathrm{E}_{p(x)}(\log q_\theta(x)) &= \mathrm{E}_{p(x)}(\phi(x)) - \nabla_\theta A(\theta). \end{align} $$ But as I've derived in this answer, $\nabla_\theta A(\theta) = \mathrm{E}_{q_\theta(x)}(\phi(x))$. Therefore, the first-order condition $\nabla_\theta f(\theta) = 0$ gives us $\mathrm{E}_{p(x)}(\phi(x)) = \mathrm{E}_{q_\theta(x)}(\phi(x))$.

To verify that solving the first-order condition for $\theta$ gives us the minimizer, we must compute the Hessian matrix and check if it's positive-definite. From $\nabla_\theta \mathrm{E}_{p(x)}(\log q_\theta(x)) = \mathrm{E}_{p(x)}(\phi(x)) - \nabla_\theta A(\theta)$, it's easily observed that $\nabla_\theta^2 f(\theta) = \nabla_\theta^2 A(\theta)$, which is the covariance matrix of the sufficient statistics. I will not prove this because this is a standard result in mathematical statistics, but refer to lecture notes like this if you're interested. Again, $[\nabla_\theta^2 A(\theta)]_{ij} = \mathrm{Cov}(\phi_i(x),\phi_j(x))$. Covariance matrices are by definition positive-definite, and therefore the first-order condition when solved for $\theta$ indeed produces the minimizer.

Now, since you've asked how this plays out for normal-gamma, we've already established that the sufficient statistics are $\phi_1(x) = \log T$, $\phi_2(x) = T$, $\phi_3(x)=TX$, and $\phi_4(x)=TX^2$. To obtain the covariance matrix in full, you should compute 10 second-order derivatives $\frac{d^2}{d\theta_i d\theta_j} A(\theta)$ for $i,j=1,\ldots,4$ for $$ A(\theta)= -\left(\theta_1+\dfrac{1}{2} \right)\log\left(\frac{\theta_3^2}{4\theta_4} - \theta_2 \right) -\frac{1}{2}\log(-2\theta_4) +\log\Gamma\left(\theta_1+\frac{1}{2}\right) + \frac{1}{2}\log(2\pi), $$ for which I seriously recommend that you consider using tools like Wolfram Alpha.