Maximum Likelihood – Why Exactly Is the Observed Fisher Information Used?

fisher informationmaximum likelihood

In the standard maximum likelihood setting (iid sample $Y_{1}, \ldots, Y_{n}$ from some distribution with density $f_{y}(y|\theta_{0}$)) and in case of a correctly specified model the Fisher information is given by

$$I(\theta) = -\mathbb{E}_{\theta_{0}}\left[\frac{\partial^{2}}{\theta^{2}}\ln f_{y}(\theta) \right]$$

where the expectation is taken with respect to the true density that generated the data. I have read that the observed Fisher information

$$\hat{J}(\theta) = -\frac{\partial^{2}}{\theta^{2}}\ln f_{y}(\theta)$$

is used primary because the integral involved in calculating the (expected) Fisher Information might not be feasible in some cases. What confuses me is that even if the integral is doable, expectation has to be taken with respect to the true model, that is involving the unknown parameter value $\theta_{0}$. If that is the case it appears that without knowing $\theta_{0}$ it is impossible to compute $I$. Is this true?

Best Answer

You've got four quanties here: the true parameter $\theta_0$, a consistent estimate $\hat \theta$, the expected information $I(\theta)$ at $\theta$ and the observed information $J(\theta)$ at $\theta$. These quantities are only equivalent asymptotically, but that is typically how they are used.

  1. The observed information $$ J (\theta_0) = \frac{1}{N} \sum_{i=1}^N \frac{\partial^2}{\partial \theta_0^2} \ln f( y_i|\theta_0) $$ converges in probability to the expected information $$ I(\theta_0) = E_{\theta_0} \left[ \frac{\partial^2}{\partial \theta_0^2} \ln f( y| \theta_0) \right] $$ when $Y$ is an iid sample from $f(\theta_0)$. Here $ E_{\theta_0} (x)$ indicates the expectation w/r/t the distribution indexed by $\theta_0$: $\int x f(x | \theta_0) dx$. This convergence holds because of the law of large numbers, so the assumption that $Y \sim f(\theta_0)$ is crucial here.

  2. When you've got an estimate $\hat \theta$ that converges in probability to the true parameter $\theta_0$ (ie, is consistent) then you can substitute it for anywhere you see a $\theta_0$ above, essentially due to the continuous mapping theorem$^*$, and all of the convergences continue to hold.

$^*$ Actually, it appears to be a bit subtle.

Remark

As you surmised, observed information is typically easier to work with because differentiation is easier than integration, and you might have already evaluated it in the course of some numeric optimization. In some circumstances (the Normal distribution) they will be the same.

The article "Assessing the Accuracy of the Maximum Likelihood Estimator: Observed Versus Expected Fisher Information" by Efron and Hinkley (1978) makes an argument in favor of the observed information for finite samples.