Solved – Why does glmer not achieve the maximum likelihood (as verified by applying further generic optimization)

lme4-nlmemaximum likelihoodoptimizationr

Numerically deriving the MLEs of GLMM is difficult and, in practice, I know, we should not use brute force optimization (e.g., using optim in a simple way). But for my own educational purpose, I want to try it to make sure I correctly understand the model (see the code below). I found that I always get inconsistent results from glmer().

In particular, even if I use the MLEs from glmer as initial values, according to the likelihood function I wrote (negloglik), they are not MLEs (opt1$value is smaller than opt2). I think two potential reasons are:

negloglik is not written well so that there is too much numerical error in it, and
the model specification is wrong. For the model specification, the intended model is:

\begin{equation}
L=\prod_{i=1}^{n} \left(\int_{-\infty}^{\infty}f(y_i|N,a,b,r_{i})g(r_{i}|s)dr_{i}\right)
\end{equation}
where $f$ is a binomial pmf and $g$ is a normal pdf. I am trying to estimate $a$, $b$, and $s$. In particular, I want to know if the model specification is wrong, what the correct specification is.

p <- function(x,a,b) exp(a+b*x)/(1+exp(a+b*x))

a <- -4  # fixed effect (intercept)
b <- 1   # fixed effect (slope)
s <- 1.5 # random effect (intercept)
N <- 8
x <- rep(2:6, each=20)
n <- length(x) 
id <- 1:n
r  <- rnorm(n, 0, s) 
y  <- rbinom(n, N, prob=p(x,a+r,b))


negloglik <- function(p, x, y, N){
  a <- p[1]
  b <- p[2]
  s <- p[3]

  Q <- 100  # Inf does not work well
  L_i <- function(r,x,y){
    dbinom(y, size=N, prob=p(x, a+r, b))*dnorm(r, 0, s)
  }

  -sum(log(apply(cbind(y,x), 1, function(x){ 
    integrate(L_i,lower=-Q,upper=Q,x=x[2],y=x[1],rel.tol=1e-14)$value
  })))
}

library(lme4)
(model <- glmer(cbind(y,N-y)~x+(1|id),family=binomial))

opt0 <- optim(c(fixef(model), sqrt(VarCorr(model)$id[1])), negloglik, 
                x=x, y=y, N=N, control=list(reltol=1e-50,maxit=10000)) 
opt1 <- negloglik(c(fixef(model), sqrt(VarCorr(model)$id[1])), x=x, y=y, N=N)
opt0$value  # negative loglikelihood from optim
opt1        # negative loglikelihood using glmer generated parameters
-logLik(model)==opt1 # but these are substantially different...

A simpler example

To reduce the possibility of having large numerical error, I created a simpler example.

y  <- c(0, 3)
N  <- c(8, 8)
id <- 1:length(y)

negloglik <- function(p, y, N){
  a <- p[1]
  s <- p[2]
  Q <- 100  # Inf does not work well
  L_i <- function(r,y){
    dbinom(y, size=N, prob=exp(a+r)/(1+exp(a+r)))*dnorm(r,0,s)
  }
  -sum(log(sapply(y, function(x){
    integrate(L_i,lower=-Q, upper=Q, y=x, rel.tol=1e-14)$value
  })))
}

library(lme4)
(model <- glmer(cbind(y,N-y)~1+(1|id), family=binomial))
MLE.glmer <- c(fixef(model), sqrt(VarCorr(model)$id[1]))
opt0 <- optim(MLE.glmer, negloglik, y=y, N=N, control=list(reltol=1e-50,maxit=10000)) 
MLE.optim <- opt0$par
MLE.glmer # MLEs from glmer
MLE.optim # MLEs from optim

L_i <- function(r,y,N,a,s) dbinom(y,size=N,prob=exp(a+r)/(1+exp(a+r)))*dnorm(r,0,s)

L1 <- integrate(L_i,lower=-100, upper=100, y=y[1], N=N[1], a=MLE.glmer[1], 
                s=MLE.glmer[2], rel.tol=1e-10)$value
L2 <- integrate(L_i, lower=-100, upper=100, y=y[2], N=N[2], a=MLE.glmer[1], 
                s=MLE.glmer[2], rel.tol=1e-10)$value

(log(L1)+log(L2)) # loglikelihood (manual computation)
logLik(model)     # loglikelihood from glmer

Best Answer

Setting a high value of nAGQ in the glmer call made the MLEs from the two methods equivalent. The default precision of glmer was not very good. This settles the issue.

glmer(cbind(y,N-y)~1+(1|id),family=binomial,nAGQ=20)

See @SteveWalker's answer here Why can't I match glmer (family=binomial) output with manual implementation of Gauss-Newton algorithm? for more details.

Related Solutions

Solved – Maximum likelihood optimization error in R

This got too long for a comment. The problem with your function is that the determinant of K_plus was getting infinite or zero very quickly. I tweaked your function to calculate the log-determinant directly. I then used optim with different methods as well as nlm to search for the maximum likelihood estimates. The algorithms converged without problems. I also included the code to calculate the standard errors and confidence intervals based on the Hessian. All algorithms give very similar estimates.

The estimates are: $\widehat{\sigma}_{n}=5.30,\hat{l}=5.12,\widehat{\sigma}_{f}=45.01$.

The code is:

load("A.Rdata")
load("y.Rdata")

num_unique <- 786

Calculate_K_plus <- function(vect){
  sn2 <- (vect[1]*vect[1])
  exponent <- 1/(vect[2]*vect[2])
  sf2 <- vect[3]*vect[3]
  B <- A^exponent
  B <- sf2 * B
  B <- B + sn2*diag(num_unique)
  B
}

minus_log_likelihood <- function(vect){
  K_plus <- Calculate_K_plus(vect)
  K_plus_inv <- solve(K_plus)
  z <- determinant(K_plus, logarithm=TRUE)  
  K_plus_log_det <- as.numeric((z$sign*z$modulus)) # log-determinant of K_plus
  out <- 0.5 * ( t(y) %*% K_plus_inv %*% y ) + 0.5 * K_plus_log_det + (num_unique/2)*log(2*pi)
  out
}

#-----------------------------------------------------------------------------
# "Nelder-Mead" algorithm
#-----------------------------------------------------------------------------

res.optim <- optim(par=c(5.3, 5.1, 44.9), fn=minus_log_likelihood, hessian=TRUE, control=list(trace=TRUE, maxit=1000))

res.optim$par    
[1]  5.302362  5.123045 45.011507

fisher_info<- solve(res.optim$hessian)
prop_sigma<-sqrt(diag(fisher_info))
upper<-res.optim$par+1.96*prop_sigma
lower<-res.optim$par-1.96*prop_sigma
interval<-data.frame(value=res.optim$par, lower=lower, upper=upper)
interval

      value     lower     upper
1  5.302362  5.032848  5.571877
2  5.123045  3.442932  6.803157
3 45.011507 17.952756 72.070257

#-----------------------------------------------------------------------------
# "L-BFGS-B" algorithm
#-----------------------------------------------------------------------------

res.optim2 <- optim(par=c(5.3, 5.1, 44.9), fn=minus_log_likelihood, method=c("L-BFGS-B"), hessian=TRUE, control=list(trace=3, maxit=1000))

res.optim2    
[1]  5.301418  5.114984 44.901863

fisher_info<- solve(res.optim2$hessian)
prop_sigma<-sqrt(diag(fisher_info))
upper<-res.optim2$par+1.96*prop_sigma
lower<-res.optim2$par-1.96*prop_sigma
interval2<-data.frame(value=res.optim2$par, lower=lower, upper=upper)
interval2

      value     lower     upper
1  5.301418  5.031988  5.570848
2  5.114984  3.437925  6.792043
3 44.901863 17.982520 71.821206

#-----------------------------------------------------------------------------
# With "nlminb"
#-----------------------------------------------------------------------------

res.nlm <- nlminb(objective=minus_log_likelihood, start=c(5.3, 5.1, 44.9), control=list(iter.max=200, trace=1))

res.nlm$par
[1]  5.301542  5.123718 45.072189

#-----------------------------------------------------------------------------
# With "nlm"
#-----------------------------------------------------------------------------

res.nlm2 <- nlm(f=minus_log_likelihood, p=c(5.3, 5.1, 44.9), print.level=2)

res.nlm2$estimate
[1]  5.301534  5.123776 45.072711

Maximum Likelihood – Why Use Maximum Likelihood Instead of Expected Likelihood?

The method proposed (after normalizing the likelihood to be a density) is equivalent to estimating the parameters using a flat prior for all the parameters in the model and using the mean of the posterior distribution as your estimator. There are cases where using a flat prior can get you into trouble because you don't end up with a proper posterior distribution so I don't know how you would rectify that situation here.

Staying in a frequentist context, though, the method doesn't make much sense since the likelihood doesn't constitute a probability density in most contexts and there is nothing random left so taking an expectation doesn't make much sense. Now we can just formalize this as an operation we apply to the likelihood after the fact to obtain an estimate but I'm not sure what the frequentist properties of this estimator would look like (in the cases where the estimate actually exists).

Advantages:

This can provide an estimate in some cases where the MLE doesn't actually exist.
If you're not stubborn it can move you into a Bayesian setting (and that would probably be the natural way to do inference with this type of estimate). Ok so depending on your views this may not be an advantage - but it is to me.

Disadvantages:

This isn't guaranteed to exist either.
If we don't have a convex parameter space the estimate may not be a valid value for the parameter.
The process isn't invariant to reparameterization. Since the process is equivalent to putting a flat prior on your parameters it makes a difference what those parameters are (are we talking about using $\sigma$ as the parameter or are we using $\sigma^2$)

Best Answer

Related Solutions

Solved – Maximum likelihood optimization error in R

Maximum Likelihood – Why Use Maximum Likelihood Instead of Expected Likelihood?

Related Question