Yudi Pawitan writes in his book In All Likelihood that the second derivative of the log-likelihood evaluated at the maximum likelihood estimates (MLE) is the observed Fisher information (see also this document, page 1). This is exactly what most optimization algorithms like optim
in R
return: the Hessian evaluated at the MLE. When the negative log-likelihood is minimized, the negative Hessian is returned. As you correctly point out, the estimated standard errors of the MLE are the square roots of the diagonal elements of the inverse of the observed Fisher information matrix. In other words: The square roots of the diagonal elements of the inverse of the Hessian (or the negative Hessian) are the estimated standard errors.
Summary
- The negative Hessian evaluated at the MLE is the same as the observed Fisher information matrix evaluated at the MLE.
- Regarding your main question: No, it's not correct that the
observed Fisher information can be found by inverting the (negative)
Hessian.
- Regarding your second question: The inverse of the (negative) Hessian is an estimator of the asymptotic covariance matrix. Hence, the square roots of the diagonal elements of covariance matrix are estimators of the standard errors.
- I think the second document you link to got it wrong.
Formally
Let $l(\theta)$ be a log-likelihood function. The Fisher information matrix $\mathbf{I}(\theta)$ is a symmetrical $(p\times p)$ matrix containing the entries:
$$
\mathbf{I}(\theta)=-\frac{\partial^{2}}{\partial\theta_{i}\partial\theta_{j}}l(\theta),~~~~ 1\leq i, j\leq p
$$
The observed Fisher information matrix is simply $\mathbf{I}(\hat{\theta}_{\mathrm{ML}})$, the information matrix evaluated at the maximum likelihood estimates (MLE). The Hessian is defined as:
$$
\mathbf{H}(\theta)=\frac{\partial^{2}}{\partial\theta_{i}\partial\theta_{j}}l(\theta),~~~~ 1\leq i, j\leq p
$$
It is nothing else but the matrix of second derivatives of the likelihood function with respect to the parameters. It follows that if you minimize the negative log-likelihood, the returned Hessian is the equivalent of the observed Fisher information matrix whereas in the case that you maximize the log-likelihood, then the negative Hessian is the observed information matrix.
Further, the inverse of the Fisher information matrix is an estimator of the asymptotic covariance matrix:
$$
\mathrm{Var}(\hat{\theta}_{\mathrm{ML}})=[\mathbf{I}(\hat{\theta}_{\mathrm{ML}})]^{-1}
$$
The standard errors are then the square roots of the diagonal elements of the covariance matrix.
For the asymptotic distribution of a maximum likelihood estimate, we can write
$$
\hat{\theta}_{\mathrm{ML}}\stackrel{a}{\sim}\mathcal{N}\left(\theta_{0}, [\mathbf{I}(\hat{\theta}_{\mathrm{ML}})]^{-1}\right)
$$
where $\theta_{0}$ denotes the true parameter value. Hence, the estimated standard error of the maximum likelihood estimates is given by:
$$
\mathrm{SE}(\hat{\theta}_{\mathrm{ML}})=\frac{1}{\sqrt{\mathbf{I}(\hat{\theta}_{\mathrm{ML}})}}
$$
The issues is that the standard errors comes from
$$\hat\sigma^2 (X^\top X)^{-1}$$
where $\hat\sigma^2$ is the unbiased estimator and not the MLE. See summary.lm
summary.lm
#R function (object, correlation = FALSE, symbolic.cor = FALSE,
#R ...)
#R {
#R z <- object
#R p <- z$rank
#R rdf <- z$df.residual
#R ...
#R Qr <- qr.lm(object)
#R ...
#R r <- z$residuals
#R f <- z$fitted.values
#R w <- z$weights
#R if (is.null(w)) {
#R mss <- if (attr(z$terms, "intercept"))
#R sum((f - mean(f))^2)
#R else sum(f^2)
#R rss <- sum(r^2)
#R }
#R ...
#R resvar <- rss/rdf
#R ...
#R R <- chol2inv(Qr$qr[p1, p1, drop = FALSE])
#R se <- sqrt(diag(R) * resvar)
#R ...
This is the inverse observed Fisher information for $(\beta_0, \beta_1)$ conditional on $\hat\sigma^2$. Now the inverse observed Fisher information you compute is for the triplet $(\beta_0, \beta_1, \sigma)$. I.e., you use the MLE of $\sigma$ and not the unbiased estimator. Thus, I gather the standard errors should differ by factor $\sqrt{n/(n-3 + 1)}$ or something similar. This is the case
set.seed(1)
n = 4 # very small sample size !
b0 <- 5
b1 <- 2
sigma <- 5
x <- runif(n, 1, 100)
y = b0 + b1*x + rnorm(n, 0, sigma)
negLL <- function(beta, y, x) {
b0 <- beta[1]
b1 <- beta[2]
sigma <- beta[3]
yhat <- b0 + b1*x
return(-sum(dnorm(y, yhat, sigma, log = TRUE)))
}
res <- optim(c(0, 0, 1), negLL, y = y, x = x, hessian=TRUE)
estimates <- res$par # Parameters estimates
(se <- sqrt(diag(solve(res$hessian))))
#R [1] 5.690 0.097 1.653
k <- 3
se * sqrt(n / (n-k+1))
#R [1] 8.047 0.137 2.338
To elaborate more as usεr11852 requests, the log-likelihood is
$$l(\vec{\beta},\sigma) =
-\frac{n}{2}\log(2\pi) - n\log{\sigma}
- \frac{1}{2\sigma^2}(\vec{y}-X\vec\beta)^\top(\vec{y}-X\vec\beta)$$
where $X$ is the design matrix and $n$ is the number of observation. Consequently, the observed information matrix is
$$-\nabla_{\vec{\beta}}\nabla_{\vec{\beta}}^\top
l(\vec{\beta},\sigma) = \frac{1}{\sigma^2}X^\top X$$
Now we can either plug in the MLE or the unbaised estimator of $\sigma$ as the following show
m <- lm(y ~ x)
X <- cbind(1, x)
sqrt(sum(resid(m)^2)/n * diag(solve(crossprod(X))))
#R x
#R 5.71058285 0.09732149
k <- 3
sqrt(sum(resid(m)^2)/(n-k+1) * diag(solve(crossprod(X))))
#R x
#R 8.0759837 0.1376334
We can do the same with a QR decomposition as lm
does
obj <- qr(X)
sqrt(sum(resid(m)^2)/(n-k+1) * diag(chol2inv(obj$qr)))
#R [1] 8.0759837 0.1376334
So to answer
I understand from my readings on the web that optimization is not a simple task but I was wondering if it would be possible to reproduce in a simple way the standard error estimates from glm
while using optim
.
then you need to scale up the standard errors in the Gaussian example you use.
Best Answer
No, the expectation is based on the model. We're not getting some kind of ensemble-average, we're literally finding an expectation:
$\mathcal{I}(\theta) = - \text{E} \left[\left. \frac{\partial^2}{\partial\theta^2} \log f(X;\theta)\right|\theta \right]\,.$
(though we might be finding it from a different expression that yields the same quantity).
That is, we do some algebra before we implement it in computation.
We have a single ML estimate, and we're computing the standard error from the second derivative of the likelihood at the peak -- a "sharp" peak means a small standard error, while a broad peak means a large standard error.
You might like to see that when you do this for a normal likelihood (iid observations from $N(\mu,\sigma)$, with $\sigma$ known) that this calculation yields that the Fisher information is $n/\sigma^2$, and hence that the asymptotic variance of the ML estimate of $\mu$ is $\sigma^2/n$, or its standard error is $\sigma/\sqrt{n}$. (Of course in this case, that's also the small-sample variance and standard error.)