Why is the Fisher information matrix both an expected outer product and a Hessian

fisher informationmultivariable-calculusstatistical-inference

If $X$ is a random variable distributed as $X \sim p(x ; {\theta^*})$, the Fisher information matrix is defined as the expected outer product matrix: $$
I(\theta) = E_{X \sim p(x ; {\theta^*})} \left[
\,\left(\nabla_{{\theta}} \log p(X\,; {\theta}) \right)
\left(\nabla_{{\theta}} \log p(X\,; {\theta}) \right)^\top \,\right].
$$

However, it is also defined as the expected Hessian matrix of the negative log-likelihood: $$
I(\theta) = E_{X \sim p(x ; {\theta^*})} \left[
\frac{\partial^2}{(\partial\theta)(\partial\theta^\top)} \bigl(-\log p(X;\theta)\bigr)
\right].
$$

I am puzzled by why these two definitions are equivalent. Specifically, I'm not sure why and (expected) outer product of first partial derivatives should be equal to the matrix of 2nd derivatives.

Intuitions or derivations are appreciated!

Best Answer

Denote by $\nabla$ and $\nabla^2$ the gradient and Hessian operators with respect to $\theta$, and denote the score by $\ell(\theta;X) = \log p_\theta(X)$. Using differential identities, you can show that the expectation of the gradient of the score is zero, i.e. $\mathbb{E}[\nabla \ell(\theta;X)] = 0$. Therefore, since the Fisher information is defined as the variance of the score, we find that \begin{align*} I(\theta) ={}& \text{var}(\nabla \ell(\theta;X)) \\ ={}& \mathbb{E}[\nabla\ell(\theta;X)\nabla\ell(\theta;X)^\top] - \mathbb{E}[\nabla\ell(\theta;X)]\mathbb{E}[\nabla \ell(\theta;X)]^\top \\ ={}& \mathbb{E}[\nabla\ell(\theta;X)\nabla\ell(\theta;X)^\top]. \end{align*} Now, by the chain rule for second derivatives we have that \begin{align*} \nabla^2\ell(\theta;X) ={}& \nabla^2 \log p_\theta(X) \\ ={}& \frac{1}{p_\theta(X)}\nabla^2 p_\theta(X) - \frac{1}{p_\theta^2(X)}\nabla p_\theta(X) \nabla p_\theta(X)^\top. \end{align*} Note that $\frac{1}{p_\theta(X)}\nabla p_\theta(X) = \nabla \log p_\theta(X) = \nabla \ell (\theta;X)$. Therefore, under some regularity conditions, we obtain \begin{align*} \mathbb{E}[\nabla^2\ell(\theta;X)] ={}& \mathbb{E}\left[\frac{1}{p_\theta(X)}\nabla^2 p_\theta(X)\right] - \mathbb{E}[\nabla\ell(\theta;X)\nabla\ell(\theta;X)^\top] \\ ={}& \int \left(\frac{1}{p_\theta(x)}\nabla^2 p_\theta(x)\right) p_\theta(x) dx - \mathbb{E}[\nabla\ell(\theta;X)\nabla\ell(\theta;X)^\top] \\ ={}& \nabla^2 \int p_\theta(x)dx - \mathbb{E}[\nabla\ell(\theta;X)\nabla\ell(\theta;X)^\top] \\ ={}& \nabla^2 1- \mathbb{E}[\nabla\ell(\theta;X)\nabla\ell(\theta;X)^\top] \\ ={}& - \mathbb{E}[\nabla\ell(\theta;X)\nabla\ell(\theta;X)^\top], \end{align*} as desired.