Optimization – Residual Standard Error Difference Between Optim and GLM in R

maximum likelihoodoptimizationr

I try to reproduce with optim the results from a simple linear regression fitted with glm or even nls R functions.
The parameters estimates are the same but the residual variance estimate and the standard errors of the other parameters are not the same particularly when the sample size is low. I suppose that this is due differences in the way the residual standard error is calculated between Maximum Likelihood and Least square approaches (dividing by n or by n-k+1 see bellow in the example).
I understand from my readings on the web that optimization is not a simple task but I was wondering if it would be possible to reproduce in a simple way the standard error estimates from glm while using optim.

Simulate a small dataset

set.seed(1)
n = 4 # very small sample size !
b0 <- 5
b1 <- 2
sigma <- 5
x <- runif(n, 1, 100)
y =  b0 + b1*x + rnorm(n, 0, sigma)

Estimate with optim

negLL <- function(beta, y, x) {
    b0 <- beta[1]
    b1 <- beta[2]
    sigma <- beta[3]
    yhat <- b0 + b1*x
    likelihood <- dnorm(y, yhat, sigma)
    return(-sum(log(likelihood)))
}

res <- optim(starting.values, negLL, y = y, x = x, hessian=TRUE)
estimates <- res$par     # Parameters estimates
se <- sqrt(diag(solve(res$hessian))) # Standard errors of the estimates
cbind(estimates,se)


    > cbind(estimates,se)
      estimates         se
b0     9.016513 5.70999880
b1     1.931119 0.09731153
sigma  4.717216 1.66753138

Comparison with glm and nls

> m <- glm(y ~ x)
> summary(m)$coefficients
            Estimate Std. Error   t value    Pr(>|t|)
(Intercept) 9.016113  8.0759837  1.116411 0.380380963
x           1.931130  0.1376334 14.030973 0.005041162
> sqrt(summary(m)$dispersion) # residuals standard error
[1] 6.671833
> 
> summary(nls( y ~ b0 + b1*x, start=list(b0 = 5, b1= 2)))

Formula: y ~ b0 + b1 * x

Parameters:
   Estimate Std. Error t value Pr(>|t|)   
b0   9.0161     8.0760   1.116  0.38038   
b1   1.9311     0.1376  14.031  0.00504 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 6.672 on 2 degrees of freedom

I can reproduce the different residual standard error estimates like this :

> # optim / Maximum Likelihood estimate
> sqrt(sum(resid(m)^2)/n)
[1] 4.717698
> 
> # Least squares estimate (glm and nls estimates)
> k <- 3 # number of parameters
> sqrt(sum(resid(m)^2)/(n-k+1))
[1] 6.671833

Best Answer

The issues is that the standard errors comes from

$$\hat\sigma^2 (X^\top X)^{-1}$$

where $\hat\sigma^2$ is the unbiased estimator and not the MLE. See summary.lm

summary.lm
#R function (object, correlation = FALSE, symbolic.cor = FALSE, 
#R     ...) 
#R {
#R    z <- object
#R    p <- z$rank
#R    rdf <- z$df.residual
#R    ...
#R    Qr <- qr.lm(object) 
#R    ... 
#R    r <- z$residuals
#R    f <- z$fitted.values
#R    w <- z$weights
#R    if (is.null(w)) {
#R         mss <- if (attr(z$terms, "intercept")) 
#R             sum((f - mean(f))^2)
#R         else sum(f^2)
#R         rss <- sum(r^2)
#R    }
#R    ...
#R    resvar <- rss/rdf
#R    ...
#R    R <- chol2inv(Qr$qr[p1, p1, drop = FALSE])
#R    se <- sqrt(diag(R) * resvar)
#R    ...

This is the inverse observed Fisher information for $(\beta_0, \beta_1)$ conditional on $\hat\sigma^2$. Now the inverse observed Fisher information you compute is for the triplet $(\beta_0, \beta_1, \sigma)$. I.e., you use the MLE of $\sigma$ and not the unbiased estimator. Thus, I gather the standard errors should differ by factor $\sqrt{n/(n-3 + 1)}$ or something similar. This is the case

set.seed(1)
n = 4 # very small sample size !
b0 <- 5
b1 <- 2
sigma <- 5
x <- runif(n, 1, 100)
y =  b0 + b1*x + rnorm(n, 0, sigma) 

negLL <- function(beta, y, x) {
  b0 <- beta[1]
  b1 <- beta[2]
  sigma <- beta[3]
  yhat <- b0 + b1*x
  return(-sum(dnorm(y, yhat, sigma, log = TRUE)))
}

res <- optim(c(0, 0, 1), negLL, y = y, x = x, hessian=TRUE)
estimates <- res$par     # Parameters estimates
(se <- sqrt(diag(solve(res$hessian))))
#R [1] 5.690 0.097 1.653
k <- 3
se * sqrt(n / (n-k+1))
#R [1] 8.047 0.137 2.338

To elaborate more as usεr11852 requests, the log-likelihood is

$$l(\vec{\beta},\sigma) = -\frac{n}{2}\log(2\pi) - n\log{\sigma} - \frac{1}{2\sigma^2}(\vec{y}-X\vec\beta)^\top(\vec{y}-X\vec\beta)$$

where $X$ is the design matrix and $n$ is the number of observation. Consequently, the observed information matrix is

$$-\nabla_{\vec{\beta}}\nabla_{\vec{\beta}}^\top l(\vec{\beta},\sigma) = \frac{1}{\sigma^2}X^\top X$$

Now we can either plug in the MLE or the unbaised estimator of $\sigma$ as the following show

m <- lm(y ~ x)
X <- cbind(1, x)
sqrt(sum(resid(m)^2)/n       * diag(solve(crossprod(X))))
#R                     x 
#R 5.71058285 0.09732149
k <- 3
sqrt(sum(resid(m)^2)/(n-k+1) * diag(solve(crossprod(X))))
#R                   x 
#R 8.0759837 0.1376334

We can do the same with a QR decomposition as lm does

obj <- qr(X)
sqrt(sum(resid(m)^2)/(n-k+1) * diag(chol2inv(obj$qr)))
#R [1] 8.0759837 0.1376334

So to answer

I understand from my readings on the web that optimization is not a simple task but I was wondering if it would be possible to reproduce in a simple way the standard error estimates from glm while using optim.

then you need to scale up the standard errors in the Gaussian example you use.

Related Solutions

Solved – Estimation with MLE and returning the score/gradient (QMLE)

The numDeriv package can indeed be used to compute the gradient and the hessian (if needed). In both cases the argument y of the log-likelihood is passed through the dots mechanism, using an argument with the suitable name. For a vector-valued function, the jacobian function of the same package can be used similarly.

You could also consider computing analytical derivatives rather than numerical ones.

library(numDeriv)
H <- hessian(func = loglik, x = b$par, y = y)
g <- grad(func = loglik, x = b$par, y = y)

We can compute as well the Jacobian of a function returning a vector of length $T-1$.

mlogDens <- function(theta, y) {
  T <- length(y)
  -dnorm(y[2:T], mean = theta[1] + theta[2] * y[1:(T-1)], 
                 sd = theta[3], log = TRUE)
}
## a matrix with dim (T-1, 3)
G <- jacobian(func = mlogDens, x = b$par, y = y)
## a matrix with dim (9, T-1)
GG <- apply(G, MARGIN = 1, FUN = tcrossprod)
## a vector with length 9 representing a symmetric mat.
GGsum0 <- apply(GG, MARGIN = 1, FUN = sum)
## a symmetric mat.
GGsum <- matrix(GGsum0, nrow = 3, ncol = 3)

Solved – Why calculating standard error of an mle (and confidence intervals) from Hessian matrices

I assume to get the Expected Hessian matrix I need to run my maximum likelihood program multiple iterations to get multiple hessian matrices

No, the expectation is based on the model. We're not getting some kind of ensemble-average, we're literally finding an expectation:

$\mathcal{I}(\theta) = - \text{E} \left[\left. \frac{\partial^2}{\partial\theta^2} \log f(X;\theta)\right|\theta \right]\,.$

(though we might be finding it from a different expression that yields the same quantity).

That is, we do some algebra before we implement it in computation.

We have a single ML estimate, and we're computing the standard error from the second derivative of the likelihood at the peak -- a "sharp" peak means a small standard error, while a broad peak means a large standard error.

You might like to see that when you do this for a normal likelihood (iid observations from $N(\mu,\sigma)$, with $\sigma$ known) that this calculation yields that the Fisher information is $n/\sigma^2$, and hence that the asymptotic variance of the ML estimate of $\mu$ is $\sigma^2/n$, or its standard error is $\sigma/\sqrt{n}$. (Of course in this case, that's also the small-sample variance and standard error.)

Best Answer

Related Solutions

Solved – Estimation with MLE and returning the score/gradient (QMLE)

Solved – Why calculating standard error of an mle (and confidence intervals) from Hessian matrices

Related Question