Fisher Information – Why is the Fisher Information the Inverse of the Asymptotic Covariance?

chi-squared-distributionfisher informationmathematical-statisticsmaximum likelihoodself-study

For the multinomial distribution, I had spent a lot of time and effort calculating the inverse of the Fisher information (for a single trial) using things like the Sherman-Morrison formula. But apparently it is exactly the same thing as the covariance matrix of a suitably normalized multinomial.

I.e. all of the effort calculating the log-likelihood, the score and its partial derivatives, taking their expectations, and then inverting this matrix, was completely wasted.

This relationship also appears to be alluded to in the answers to this question.

Question: Why does this convenient relationship exist? How is it stated formally?

My guess is that it has something to do with the "asymptotic distribution of the MLE". Specifically, it says on p. 175 of Keener, Theoretical Statistics: Topics for a Core Course, that $$\sqrt{n}(\hat{\theta} – \theta) \overset{d}{\implies} \mathscr{N}(0, I(\theta)^{-1})\,. $$
So if this normalized version of the multinomial satisfies the Cramer-Rao lower bound/information inequality, (maybe?), its covariance will equal its asymptotic covariance? Because the MLE is supposed to be asymptotically unbiased.

The basis for this question is my attempt to complete exercise 12.56 in Lehmann, Romano, Testing Statistical Hypotheses, to verify that the Pearson's $\chi^2$ test of goodness-of-fit is a special case of the Rao score test, as well as my attempt to understand the proof of Theorem 14.3.1 (i) of the same book. In the proof, when showing that the statistic converges in distribution to $\chi^2_k$, he pulls this $$V_n := n^{1/2}\left(\frac{N_1}{n} – p_0(1), \dots, \frac{N_k}{n} – p_0(k)\right) \,, $$ seemingly out of a hat, and yet it solves the problem.

But my friend told me that $(\frac{N_1}{n}, \dots, \frac{N_k}{n})$ is the MLE for the parameters of the multinomial. If this is true, then the vector which Lehmann and Romano pulled out of a hat was actually $\sqrt{n}(\hat{\theta}_n – \theta)$, for which, by the above result about the asymptotic distribution of the MLE, $$V_n^T I(\theta) V_n \overset{d}{\implies} \chi^2_k \,. $$

But in Lehmann-Romano, they derive this $I(\theta)$ as the inverse of the covariance of $V_n$. How did they know how to do this? I.e. how did they know that the Cramer-Rao lower bound held in this case?

Best Answer

Never mind, I just realized that this question was stupid.

Specifically, we have that by the Multivariate Central Limit Theorem (which doesn't depend on the MLE result in anyway, so this is not circular reasoning or whatever): $$\sqrt{n}(\hat{\theta}_n - \theta) = V_n \overset{d}{\implies} \mathscr{N}(0, \Sigma) $$ where $\Sigma$ is the covariance matrix of $V_n$. Then, by the MLE result, we also have that $$ V_n = \sqrt{n}(\hat{\theta}_n - \theta) \overset{d}{\implies}\mathscr{N}(0, I(\theta)^{-1}) \,.$$

Comparing the equations (and since limits in distribution are unique), it obviously follows that $$\Sigma = I(\theta)^{-1}\, \iff \Sigma^{-1} = I(\theta) \,. $$ So this doesn't actually require the Cramer-Rao Lower bound to hold for $V_n$ (it seems to me).

Related Question