[Math] the difference between observed information and Fisher information

fisher informationstatistical-inferencestatistics

I have in literature saying the observed information $J(\theta)$ is equal to the Fisher information $I(\theta)$. They are given different donations and same parameter. It is not clear why if equal they have different donations. Could anyone please explain?

Best Answer

Let $X_1,...,X_n \sim f(x;\theta)$. Fisher information is a theoretical measure defined by $$ \mathcal{I}(\theta) = - \mathbb{E}\left[\frac{\partial^2}{\partial\theta^2}\ln f(x:\theta) \right], $$ where $\theta$ is the unknown parameter of interest, hence for sample of size $n$ and MLE $\hat{\theta}_n$, you can estimate the fisher information by $n\mathcal{I}(\hat{\theta}_n)$.

Observed information is defined by $$ \mathcal{I}_{obs}(\theta) = - n\left[\frac{1}{n}\sum_{i=1}^n\frac{\partial^2}{\partial^2 \theta}(\ln f(x_i:\hat{\theta}_n)) \right], $$
which is simply a sample equivalent of the above. So, as you can see, these two notions defined differently, however if you plug-in the MLE in fisher information you get exactly the observed information, $\mathcal{I}_{obs}(\theta)=n\mathcal{I}(\hat{\theta}_n)$.
To show it for a pretty general case, you can work out the algebra for a single parametric exponential family distribution (it is a straightforward calculations).

Related Solutions

Statistics – Intuitive Explanation of Fisher Information

From the way you write the information, it seems that you assume you have only one parameter to estimate ($\theta$) and you consider one random variable (the observation $X$ from the sample). This makes the argument much simpler so I will carry it in this way.

You use the information when you want to conduct inference by maximizing the log likelihood. That log-likelihood is a function of $\theta$ that is random because it depends on $X$. You would like to find a unique maximum by locating the theta that gives you that maximum. Typically, you solve the first order conditions by equating the score $\frac{\partial\ell \left( \theta ; x \right)}{\partial \theta} = \frac{\partial\log p \left( x ; \theta \right)}{\partial \theta}$ to 0. Now you would like to know how accurate that estimate is. How much curvature the likelihood function around its maximum is going to give you that information (if it's peaked around the maximum, you are fairly certain, otherwise if the likelihood is flat you are quite uncertain about the estimate). Probabilistically, you would like to know the variance of the score "around there" (this is heuristic and a non-rigorous argument. You could actually show the equivalence between the geometric and probabilistic/statistical concepts).

Now, we know that on average, the score is zero (see proof of that point at the end of this answer). Thus \begin{eqnarray*} E \left[ \frac{\partial \ell \left( \theta ; x \right)}{\partial \theta} \right] & = & 0\\ \int \frac{\partial \ell \left( \theta ; x \right)}{\partial \theta} p \left( x ; \theta \right) d x & = & 0 \end{eqnarray*} Take derivatives at both sides (we can interchange integral and derivative here but I am not going to give rigorous conditions here) \begin{eqnarray*} \frac{\partial}{\partial \theta} \int \frac{\partial \ell \left( \theta ; x \right)}{\partial \theta} p \left( x ; \theta \right) d x & = & 0\\ \int \frac{\partial^2 \ell \left( \theta ; x \right)}{\partial \theta^2} p \left( x ; \theta \right) d x + \int \frac{\partial \ell \left( \theta ; x \right)}{\partial \theta} \frac{\partial p \left( x ; \theta \right)}{\partial \theta} d x & = & 0 \end{eqnarray*}

The second term on the left-hand side is \begin{eqnarray*} \int \frac{\partial \ell \left( \theta ; x \right)}{\partial \theta} \frac{\partial p \left( x ; \theta \right)}{\partial \theta} d x & = & \int \frac{\partial \log p \left( x ; \theta \right)}{\partial \theta} \frac{\partial p \left( x ; \theta \right)}{\partial \theta} d x\\ & = & \int \frac{\partial \log p \left( x ; \theta \right)}{\partial \theta} \frac{\frac{\partial p \left( x ; \theta \right)}{\partial \theta}}{p \left( x ; \theta \right)} p \left( x ; \theta \right) d x\\ & = & \int \left( \frac{\partial \log p \left( x ; \theta \right)}{\partial \theta} \right)^2 p \left( x ; \theta \right) d x\\ & = & V \left[ \frac{\partial \ell \left( \theta ; x \right)}{\partial \theta} \right] \end{eqnarray*}

(here the second follows from dividing and multiplying by $p(x;\theta)$. The third line follows from applying the chain rule to derivative of log. The final line follows from the expectation of the score being zero, that is the variance is equal to the expectation of the square and no need to subtract the square of the expectation.)

From which you can see

\begin{eqnarray*} V \left[ \frac{\partial \ell \left( \theta ; x \right)}{\partial \theta} \right] & = & - \int \frac{\partial^2 \ell \left( \theta ; x \right)}{\partial \theta^2} p \left( x ; \theta \right) dx\\ & = & - E \left[ \frac{\partial^2 \ell \left( \theta ; x \right)}{\partial \theta^2} \right] \end{eqnarray*}

Now you could see why summarizing uncertainty (curvature) about the likelihood function takes the particular formula of Fisher information.

We can even go further and prove that the maximum likelihood estimator best possible efficiency is given by the inverse of the information (this is called the Cramér-Rao lower bound).

To answer an additional question by the OP, I will show what the expectation of the score is zero. Since $p \left( x, \theta \right)$ is a density \begin{eqnarray*} \int p \left( x ; \theta \right) \mathrm{d} x & = & 1 \end{eqnarray*} Take derivatives on both sides \begin{eqnarray*} \frac{\partial}{\partial \theta} \int p \left( x ; \theta \right) \mathrm{d} x & = & 0 \end{eqnarray*} Looking on the left hand side \begin{eqnarray*} \frac{\partial}{\partial \theta} \int p \left( x ; \theta \right) \mathrm{d} x & = & \int \frac{\partial p \left( x ; \theta \right)}{\partial \theta} \mathrm{d} x\\ & = & \int \frac{\frac{\partial p \left( x ; \theta \right)}{\partial \theta}}{p \left( x ; \theta \right)} p \left( x ; \theta \right) \mathrm{d} x\\ & = & \int \frac{\partial \log p \left( x ; \theta \right)}{\partial \theta} p \left( x ; \theta \right) \mathrm{d} x\\ & = & E \left[ \frac{\partial \ell \left( \theta ; x \right)}{\partial \theta} \right] \end{eqnarray*} Thus the expectation of the score is zero.

This was a non-rigorous exposition. I recommend you follow on the arguments here in a very good textbook on statistical inference. (I personally recommend the book by Casella and Berger but there are many other excellent books.)

[Math] Fisher information for exponential distribution

Yes it's correct. Very well done.

This doesn't simplify the work a lot in this case, but here's an interesting result . . . In the case of $n$ i.i.d. random variables $y_1,\dots,y_n$ , you can obtain the Fisher information $i_{\vec y}(\theta)$ for $\vec y$ via $n \cdot i_y (\theta$) where $y$ is a single observation from your distribution.

Here $\ell(\theta) = \ln( \frac{1}{\theta} e^{-y/\theta}) = -y/\theta - \ln(\theta) \implies \frac{\partial}{\partial \theta} \ell (\theta) = \frac{y}{\theta^2} - \frac{1}{\theta} \implies \frac{\partial^2}{\partial \theta^2} \ell(\theta) = - \frac{2y}{\theta^3} + \frac{1}{\theta^2}$ \begin{align*} i_y(\theta) &= - E \left[ \frac{\partial^2}{\partial \theta^2} \ell(\theta) \right] = -E \left[ - \frac{2y}{\theta^3} + \frac{1}{\theta^2} \right] = \dfrac{2 \theta}{\theta^3} - \dfrac{1}{\theta^2} = \dfrac{1}{\theta^2} \end{align*} and multiplying by $n$ gives Fisher information $n/\theta^2$.

Best Answer

Related Solutions

Statistics – Intuitive Explanation of Fisher Information

[Math] Fisher information for exponential distribution

Related Question