Maximum Likelihood Inference – Exploring Fisher Information and Variance of Score Function

fisher informationinferencemaximum likelihood

The fisher information's connection with the negative expected hessian at $\theta_{MLE}$, provides insight in the following way: at the MLE, high curvature implies that an estimate of $\theta$ even slightly different from the true MLE would have resulted in a very different likelihood.
$$
\mathbf{I}(\theta)=-\frac{\partial^{2}}{\partial\theta_{i}\partial\theta_{j}}l(\theta),~~~~ 1\leq i, j\leq p
$$

This is good, as that means that we can be relatively sure about our estimate.

The other connection of Fisher information to variance of the score, when evaluated at the MLE is less clear to me.
$$ I(\theta) = E[(\frac{\partial}{\partial\theta}l(\theta))^2]$$

The implication is; high Fisher information -> high variance of score function at the MLE.

Intuitively, this means that the score function is highly sensitive to the sampling of the data. i.e – we are likely to get a non-zero gradient of the likelihood, had we sampled a different data distribution. This seems to have a negative implication to me. Don't we want the score function = 0 to be highly robust to different sampling of the data?

A lower fisher information on the other hand, would indicate the score function has low variance at the MLE, and has mean zero. This implies that regardless of the sampling distribution, we will get a gradient of log likelihood to be zero (which is good!).

What am I missing?

Best Answer

Intuitively, this means that the score function is highly sensitive to the sampling of the data. i.e - we are likely to get a non-zero gradient of the likelihood, had we sampled a different data distribution. This seems to have a negative implication to me. Don't we want the score function = 0 to be highly robust to different sampling of the data?

That is not the correct intuition for the score function. Remember that the score function is a derivative with respect to the parameter, not the data. The Fisher information is defined as the variance of the score, but under simple regularity conditions it is also the negative of the expected value of the second derivative of the log-likelihood. So, if we write the log-likelihood as $\ell(\theta | \mathbf{X})$ and the score function as $s(\theta | \mathbf{X})$ (i.e., with explicit conditioning on data $\mathbf{X}$) then the Fisher information is:

$$\mathcal{I}(\theta) = -\mathbb{E} \Bigg( \frac{\partial^2 \ell}{\partial \theta^2} (\theta | \mathbf{X}) \Bigg) = -\mathbb{E} \Bigg( \frac{\partial s}{\partial \theta} (\theta | \mathbf{X}) \Bigg).$$

The thing to note here is that the derivatives are taken with respect to the parameter, not the data. So, we can see that a high (magnitude) value for the Fisher information means that the score function is, on average, highly sensitive to the parameter value, not the data. If the score function is highly sensitive to the parameter value, this means that the root of the equation (which is the MLE) is relatively insensitive to the parameter, and so the MLE has lower variance.