Maximum Likelihood – Observed Information Matrix As Consistent Estimator of Expected Information Matrix

asymptoticsexpected valuefisher informationmaximum likelihood

I am trying to prove that the observed information matrix evaluated at the weakly consistent maximum likelihood estimator (MLE), is a weakly consistent estimator of the expected information matrix. This is a widely quoted result but nobody gives a reference or a proof (I have exhausted I think the first 20 pages of google results and my stats textbooks)!

Using a weakly consistent sequence of MLEs I can use the weak law of large numbers (WLLN) and the continuous mapping theorem to get the result I want. However I believe the continuous mapping theorem cannot be used. Instead I think the uniform law of large numbers (ULLN) needs to be used. Does anybody know of a reference that has a proof of this? I have an attempt at the ULLN but omit it for now for brevity.

I apologise for the length of this question but notation has to be introduced. The notation is as folows (my proof is at the end).

Assume we have an iid sample of random variables $\{Y_1,\ldots,Y_N\}$ with densities $f(\tilde{Y}|\theta)$, where $\theta\in\Theta\subseteq\mathbb{R}^{k}$ (here $\tilde{Y}$ is a just a general random variable with the same density as any one of the members of the sample). The vector $Y=(Y_1,\ldots,Y_N)^{T}$ is the vector of all the sample vectors where $Y_{i}\in\mathbb{R}^{n}$ for all $i=1,\ldots,N$. The true parameter value of the densities is $\theta_{0}$, and $\hat{\theta}_{N}(Y)$ is the weakly consistent maximum likelihood estimator (MLE) of $\theta_{0}$. Subject to regularity conditions the Fisher Information matrix can be written as

$$I(\theta)=-E_\theta \left[H_{\theta}(\log f(\tilde{Y}|\theta)\right]$$

where ${H}_{\theta}$ is the Hessian matrix. The sample equivalent is

$$I_N(\theta)=\sum_{i=1}^N I_{y_i}(\theta),$$

where $I_{y_i}=-E_\theta \left[H_{\theta}(\log f(Y_{i}|\theta)\right]$. The observed information matrix is;

$J(\theta) = -H_\theta(\log f(y|\theta)$,

(some people demand the matrix is evaluated at $\hat{\theta}$ but some don't). The sample observed information matrix is;

$J_N(\theta)=\sum_{i=1}^N J_{y_i}(\theta)$

where $J_{y_i}(\theta)=-H_\theta(\log f(y_{i}|\theta)$.

I can prove convergence in probability of the estimator $N^{-1}J_N(\theta)$ to $I(\theta)$, but not of $N^{-1}J_{N}(\hat{\theta}_N(Y))$ to $I(\theta_{0})$. Here is my proof so far;

Now $(J_{N}(\theta))_{rs}=-\sum_{i=1}^N (H_\theta(\log f(Y_i|\theta))_{rs}$ is element $(r,s)$ of $J_N(\theta)$, for any $r,s=1,\ldots,k$. If the sample is iid, then by the weak law of large numbers (WLLN), the average of these summands converges in probability to $-E_{\theta}[(H_\theta(\log f(Y_{1}|\theta))_{rs}]=(I_{Y_1}(\theta))_{rs}=(I(\theta))_{rs}$. Thus $N^{-1}(J_N(\theta))_{rs}\overset{P}{\rightarrow}(I(\theta))_{rs}$ for all $r,s=1,\ldots,k$, and so $N^{-1}J_N(\theta)\overset{P}{\rightarrow}I(\theta)$. Unfortunately we cannot simply conclude $N^{-1}J_{N}(\hat{\theta}_N(Y))\overset{P}{\rightarrow}I(\theta_0)$ by using the continuous mapping theorem since $N^{-1}J_{N}(\cdot)$ is not the same function as $I(\cdot)$.

Any help on this would be greatly appreciated.

Best Answer

$\newcommand{\convp}{\stackrel{P}{\longrightarrow}}$

I guess directly establishing some sort of uniform law of large numbers is one possible approach.

Here is another.

We want to show that $\frac{J^N(\theta_{MLE})}{N} \convp I(\theta^*)$.

(As you said, we have by the WLLN that $\frac{J^N(\theta)}{N} \convp I(\theta)$. But this doesn't directly help us.)

One possible strategy is to show that $$|I(\theta^*) - \frac{J^N(\theta^*)}{N}| \convp 0.$$

and

$$ |\frac{J^N(\theta_{MLE})}{N} - \frac{J^N(\theta^*)}{N}| \convp 0 $$

If both of the results are true, then we can combine them to get $$ |I(\theta^*) - \frac{J^N(\theta_{MLE})}{N}| \convp 0, $$

which is exactly what we want to show.

The first equation follows from the weak law of large numbers.

The second almost follows from the continuous mapping theorem, but unfortunately our function $g()$ that we want to apply the CMT to changes with $N$: our $g$ is really $g_N(\theta) := \frac{J^N(\theta)}{N}$. So we cannot use the CMT.

(Comment: If you examine the proof of the CMT on Wikipedia, notice that the set $B_\delta$ they define in their proof for us now also depends on $n$. We essentially need some sort of equicontinuity at $\theta^*$ over our functions $g_N(\theta)$.)

Fortunately, if you assume that the family $\mathcal{G} = \{g_N | N=1,2,\ldots\}$ is stochastically equicontinuous at $\theta^*$, then it immediately follows that for $\theta_{MLE} \convp \theta^*$, \begin{align*} |g_n(\theta_{MLE}) - g_n(\theta^*)| \convp 0. \end{align*}

(See here: http://www.cs.berkeley.edu/~jordan/courses/210B-spring07/lectures/stat210b_lecture_12.pdf for a definition of stochastic equicontinuity at $\theta^*$, and a proof of the above fact.)

Therefore, assuming that $\mathcal{G}$ is SE at $\theta^*$, your desired result holds true and the empirical Fisher information converges to the population Fisher information.

Now, the key question of course is, what sort of conditions do you need to impose on $\mathcal{G}$ to get SE? It looks like one way to do this is to establish a Lipshitz condition on the entire class of functions $\mathcal{G}$ (see here: http://econ.duke.edu/uploads/media_items/uniform-convergence-and-stochastic-equicontinuity.original.pdf ).

Related Question