[Math] Convergence rate of empirical Fisher information matrix

probability theorystatistics

I can see that using the law of large numbers and perhaps mild conditions on the likelihood function, one can show the empirical Fisher information matrix uniformly converges to the true Fisher information matrix. I would like to know further how fast this uniform convergence is in terms of the number of observations?

It seems that there hasn't been much study on this problem and the literature is scarce.
I tried to apply some of the results from Statistical Learning Theory, such as the analyses based on Rademacher Complexity, but those arguments seem to apply only to obtain large deviation bounds for the empirical process (i.e, the average negative log-likelihood function) and cannot be extended to its Hessian (i.e., the empirical Fisher information matrix).

I am not merely looking for asymptotic behavior which can be described by CLT, rather I would like to have large deviation-type inequalities for the Fisher information matrix. In particular I would like to know if and how fast the probability that the deviation between the spectrum of the empirical information matrix and that of the true information matrix exceeds some $\epsilon>0$ goes to zero as a function of the sample size. I'm interested in concentration inequality types of results.

EDIT: The paragraph above was added after the answer posted by -did, because I didn't have permission to add comments at that time and I wasn't aware of the editing rules. I accepted the answer by -did, but I think it addresses my problem only partially.

Best Answer

You might have got the impression that the literature is scarce simply because this is a direct consequence of the central limit theorem.

To see this while keeping things simple, let us first examine the situation where one observes $n$ i.i.d. Bernoulli trials, each with probability of success $\theta$. Then the empirical Fisher information is $I_n(\theta\mid X_n)=\dfrac{X_n}{\theta^2}+\dfrac{n-X_n}{(1-\theta)^2}$, where $X_n$ denotes the number of successes. Hence, a first remark is that $I_n(\theta\mid X_n)$ diverges when $n\to\infty$ and that, to observe non degenerate asymptotics, one should consider $$ J_n(\theta\mid X_n)=\dfrac1n\cdot I_n(\theta\mid X_n). $$ To wit, recall that the central limit theorem asserts that $X_n=n\theta+\sqrt{n\theta(1-\theta)}\cdot Z_n$, where, when $n\to\infty$, $Z_n$ converges in distribution to a standard normal random variable. Using this in the expression of $J_n(\theta\mid X_n)$, one gets $$ J_n(\theta\mid X_n)=I(\theta)+\frac{K(\theta)}{\sqrt{n}}\cdot Z_n, $$ where $I(\theta)$ is the Fisher information $I(\theta)=\dfrac1{\theta(1-\theta)}$ and $K(\theta)=\dfrac{1-2\theta}{(\theta(1-\theta))^{3/2}}$. To sum up, in the Bernoulli case, $ \dfrac{I_n(\theta\mid X_n)-n\cdot I(\theta)}{\sqrt{n}} $ converges in distribution to a centered normal random variable with variance $K(\theta)^2=\dfrac{(1-2\theta)^2}{\theta^3(1-\theta)^3}$.

The result above is quite general, provided one replaces $I(\theta)$ by the Fisher information of the distribution considered and $K(\theta)^2$ by the relevant variance. When the distribution of the sample has density $f(\ \mid\theta)$, one gets $I(\theta)=\mathrm E(g(X_1\mid\theta))$ and $K(\theta)^2=\mathrm{Var}(g(X_1\mid\theta))$, where $$ g(x\mid\theta)=-\frac{\partial^2}{\partial\theta^2}\log f(x\mid\theta). $$ (In the Bernoulli case, $f(x\mid\theta)=\theta^x(1-\theta)^{1-x}$ hence $g(x\mid\theta)=\dfrac{x}{\theta^2}+\dfrac{1-x}{(1-\theta)^2}$ and one recovers the formulas given above.)

Related Question