Solved – Why calculating standard error of an mle (and confidence intervals) from Hessian matrices

confidence intervalfisher informationhessianmaximum likelihoodstandard error

I might not have fully understood these concepts, and I am confused about how standard error is calculated. Here are my understandings and confusions, let me know where went wrong.

EDIT: I was taking about the hessian matrix output from R optim.

Standard error of an parameters $\theta$, is the standard deviation of its estimates, var$(\hat\theta)^{1/2}$. I've read that one should calculated it from the expected information matrix E$[I]^{-1/2}$ which is E$[-H]^{-1/2}$. I assume to get the Expected Hessian matrix I need to run my maximum likelihood program multiple iterations to get multiple hessian matrices. But why can't we just calculate the SD simply from taking sd($\hat\theta$), given we already have a handful amount of estimates $\hat\theta$? Are the results going to be different?

Same question on calculating the confidence interval of a parameter. For example for 95% CI, the standard way seems to be calculate from $1.96\cdot E[-H]^{-1/2}$. Is it different from just run a handful amount of iterations to get a lot of estimates $\hat\theta$, and find where 95% of them fall? Is one more accurate given the same amount of realizations?

Best Answer

I assume to get the Expected Hessian matrix I need to run my maximum likelihood program multiple iterations to get multiple hessian matrices

No, the expectation is based on the model. We're not getting some kind of ensemble-average, we're literally finding an expectation:

$\mathcal{I}(\theta) = - \text{E} \left[\left. \frac{\partial^2}{\partial\theta^2} \log f(X;\theta)\right|\theta \right]\,.$

(though we might be finding it from a different expression that yields the same quantity).

That is, we do some algebra before we implement it in computation.

We have a single ML estimate, and we're computing the standard error from the second derivative of the likelihood at the peak -- a "sharp" peak means a small standard error, while a broad peak means a large standard error.

You might like to see that when you do this for a normal likelihood (iid observations from $N(\mu,\sigma)$, with $\sigma$ known) that this calculation yields that the Fisher information is $n/\sigma^2$, and hence that the asymptotic variance of the ML estimate of $\mu$ is $\sigma^2/n$, or its standard error is $\sigma/\sqrt{n}$. (Of course in this case, that's also the small-sample variance and standard error.)

Related Solutions

Fisher Information Matrix – Understanding the Relationship to Hessian and Standard Errors

Yudi Pawitan writes in his book In All Likelihood that the second derivative of the log-likelihood evaluated at the maximum likelihood estimates (MLE) is the observed Fisher information (see also this document, page 1). This is exactly what most optimization algorithms like optim in R return: the Hessian evaluated at the MLE. When the negative log-likelihood is minimized, the negative Hessian is returned. As you correctly point out, the estimated standard errors of the MLE are the square roots of the diagonal elements of the inverse of the observed Fisher information matrix. In other words: The square roots of the diagonal elements of the inverse of the Hessian (or the negative Hessian) are the estimated standard errors.

Summary

The negative Hessian evaluated at the MLE is the same as the observed Fisher information matrix evaluated at the MLE.
Regarding your main question: No, it's not correct that the observed Fisher information can be found by inverting the (negative) Hessian.
Regarding your second question: The inverse of the (negative) Hessian is an estimator of the asymptotic covariance matrix. Hence, the square roots of the diagonal elements of covariance matrix are estimators of the standard errors.
I think the second document you link to got it wrong.

Formally

Let $l(\theta)$ be a log-likelihood function. The Fisher information matrix $\mathbf{I}(\theta)$ is a symmetrical $(p\times p)$ matrix containing the entries: $$ \mathbf{I}(\theta)=-\frac{\partial^{2}}{\partial\theta_{i}\partial\theta_{j}}l(\theta),~~~~ 1\leq i, j\leq p $$ The observed Fisher information matrix is simply $\mathbf{I}(\hat{\theta}_{\mathrm{ML}})$, the information matrix evaluated at the maximum likelihood estimates (MLE). The Hessian is defined as: $$ \mathbf{H}(\theta)=\frac{\partial^{2}}{\partial\theta_{i}\partial\theta_{j}}l(\theta),~~~~ 1\leq i, j\leq p $$ It is nothing else but the matrix of second derivatives of the likelihood function with respect to the parameters. It follows that if you minimize the negative log-likelihood, the returned Hessian is the equivalent of the observed Fisher information matrix whereas in the case that you maximize the log-likelihood, then the negative Hessian is the observed information matrix.

Further, the inverse of the Fisher information matrix is an estimator of the asymptotic covariance matrix: $$ \mathrm{Var}(\hat{\theta}_{\mathrm{ML}})=[\mathbf{I}(\hat{\theta}_{\mathrm{ML}})]^{-1} $$ The standard errors are then the square roots of the diagonal elements of the covariance matrix. For the asymptotic distribution of a maximum likelihood estimate, we can write $$ \hat{\theta}_{\mathrm{ML}}\stackrel{a}{\sim}\mathcal{N}\left(\theta_{0}, [\mathbf{I}(\hat{\theta}_{\mathrm{ML}})]^{-1}\right) $$ where $\theta_{0}$ denotes the true parameter value. Hence, the estimated standard error of the maximum likelihood estimates is given by: $$ \mathrm{SE}(\hat{\theta}_{\mathrm{ML}})=\frac{1}{\sqrt{\mathbf{I}(\hat{\theta}_{\mathrm{ML}})}} $$

Optimization – Residual Standard Error Difference Between Optim and GLM in R

The issues is that the standard errors comes from

$$\hat\sigma^2 (X^\top X)^{-1}$$

where $\hat\sigma^2$ is the unbiased estimator and not the MLE. See summary.lm

summary.lm
#R function (object, correlation = FALSE, symbolic.cor = FALSE, 
#R     ...) 
#R {
#R    z <- object
#R    p <- z$rank
#R    rdf <- z$df.residual
#R    ...
#R    Qr <- qr.lm(object) 
#R    ... 
#R    r <- z$residuals
#R    f <- z$fitted.values
#R    w <- z$weights
#R    if (is.null(w)) {
#R         mss <- if (attr(z$terms, "intercept")) 
#R             sum((f - mean(f))^2)
#R         else sum(f^2)
#R         rss <- sum(r^2)
#R    }
#R    ...
#R    resvar <- rss/rdf
#R    ...
#R    R <- chol2inv(Qr$qr[p1, p1, drop = FALSE])
#R    se <- sqrt(diag(R) * resvar)
#R    ...

This is the inverse observed Fisher information for $(\beta_0, \beta_1)$ conditional on $\hat\sigma^2$. Now the inverse observed Fisher information you compute is for the triplet $(\beta_0, \beta_1, \sigma)$. I.e., you use the MLE of $\sigma$ and not the unbiased estimator. Thus, I gather the standard errors should differ by factor $\sqrt{n/(n-3 + 1)}$ or something similar. This is the case

set.seed(1)
n = 4 # very small sample size !
b0 <- 5
b1 <- 2
sigma <- 5
x <- runif(n, 1, 100)
y =  b0 + b1*x + rnorm(n, 0, sigma) 

negLL <- function(beta, y, x) {
  b0 <- beta[1]
  b1 <- beta[2]
  sigma <- beta[3]
  yhat <- b0 + b1*x
  return(-sum(dnorm(y, yhat, sigma, log = TRUE)))
}

res <- optim(c(0, 0, 1), negLL, y = y, x = x, hessian=TRUE)
estimates <- res$par     # Parameters estimates
(se <- sqrt(diag(solve(res$hessian))))
#R [1] 5.690 0.097 1.653
k <- 3
se * sqrt(n / (n-k+1))
#R [1] 8.047 0.137 2.338

To elaborate more as usεr11852 requests, the log-likelihood is

$$l(\vec{\beta},\sigma) = -\frac{n}{2}\log(2\pi) - n\log{\sigma} - \frac{1}{2\sigma^2}(\vec{y}-X\vec\beta)^\top(\vec{y}-X\vec\beta)$$

where $X$ is the design matrix and $n$ is the number of observation. Consequently, the observed information matrix is

$$-\nabla_{\vec{\beta}}\nabla_{\vec{\beta}}^\top l(\vec{\beta},\sigma) = \frac{1}{\sigma^2}X^\top X$$

Now we can either plug in the MLE or the unbaised estimator of $\sigma$ as the following show

m <- lm(y ~ x)
X <- cbind(1, x)
sqrt(sum(resid(m)^2)/n       * diag(solve(crossprod(X))))
#R                     x 
#R 5.71058285 0.09732149
k <- 3
sqrt(sum(resid(m)^2)/(n-k+1) * diag(solve(crossprod(X))))
#R                   x 
#R 8.0759837 0.1376334

We can do the same with a QR decomposition as lm does

obj <- qr(X)
sqrt(sum(resid(m)^2)/(n-k+1) * diag(chol2inv(obj$qr)))
#R [1] 8.0759837 0.1376334

So to answer

I understand from my readings on the web that optimization is not a simple task but I was wondering if it would be possible to reproduce in a simple way the standard error estimates from glm while using optim.

then you need to scale up the standard errors in the Gaussian example you use.

Best Answer

Related Solutions

Fisher Information Matrix – Understanding the Relationship to Hessian and Standard Errors

Optimization – Residual Standard Error Difference Between Optim and GLM in R

Related Question