Solved – Why calculating standard error of an mle (and confidence intervals) from Hessian matrices

confidence intervalfisher informationhessianmaximum likelihoodstandard error

I might not have fully understood these concepts, and I am confused about how standard error is calculated. Here are my understandings and confusions, let me know where went wrong.

EDIT: I was taking about the hessian matrix output from R optim.

Standard error of an parameters $\theta$, is the standard deviation of its estimates, var$(\hat\theta)^{1/2}$. I've read that one should calculated it from the expected information matrix E$[I]^{-1/2}$ which is E$[-H]^{-1/2}$. I assume to get the Expected Hessian matrix I need to run my maximum likelihood program multiple iterations to get multiple hessian matrices. But why can't we just calculate the SD simply from taking sd($\hat\theta$), given we already have a handful amount of estimates $\hat\theta$? Are the results going to be different?

Same question on calculating the confidence interval of a parameter. For example for 95% CI, the standard way seems to be calculate from $1.96\cdot E[-H]^{-1/2}$. Is it different from just run a handful amount of iterations to get a lot of estimates $\hat\theta$, and find where 95% of them fall? Is one more accurate given the same amount of realizations?

Best Answer

I assume to get the Expected Hessian matrix I need to run my maximum likelihood program multiple iterations to get multiple hessian matrices

No, the expectation is based on the model. We're not getting some kind of ensemble-average, we're literally finding an expectation:

$\mathcal{I}(\theta) = - \text{E} \left[\left. \frac{\partial^2}{\partial\theta^2} \log f(X;\theta)\right|\theta \right]\,.$

(though we might be finding it from a different expression that yields the same quantity).

That is, we do some algebra before we implement it in computation.

We have a single ML estimate, and we're computing the standard error from the second derivative of the likelihood at the peak -- a "sharp" peak means a small standard error, while a broad peak means a large standard error.

You might like to see that when you do this for a normal likelihood (iid observations from $N(\mu,\sigma)$, with $\sigma$ known) that this calculation yields that the Fisher information is $n/\sigma^2$, and hence that the asymptotic variance of the ML estimate of $\mu$ is $\sigma^2/n$, or its standard error is $\sigma/\sqrt{n}$. (Of course in this case, that's also the small-sample variance and standard error.)

Related Question