Solved – Default lme4 optimizer requires lots of iterations for high-dimensional data

lme4-nlmemixed modelnumericsoptimizationr

TL;DR: lme4 optimization appears to be linear in the number of model parameters by default, and is way slower than an equivalent glm model with dummy variables for groups. Is there anything I can do to speed it up?


I'm trying to fit a fairly large hierarchical logit model (~50k rows, 100 columns, 50 groups). Fitting a normal logit model to the data (with dummy variables for group) works fine, but the hierarchical model appears to be getting stuck: the first optimization phase completes fine, but the second goes through a lot of iterations without anything changing and without stopping.

EDIT: I suspect the problem is mainly that I have so many parameters, because when I try to set maxfn to a lower value it gives a warning:

Warning message:
In commonArgs(par, fn, control, environment()) :
  maxfun < 10 * length(par)^2 is not recommended.

However, the parameter estimates aren't changing at all over the course of the optimization, so I'm still confused about what to do. When I tried to set maxfn in the optimizer controls (despite the warning), it seemed to hang after finishing the optimization.

Here's some code that reproduces the problem for random data:

library(lme4)

set.seed(1)

SIZE <- 50000
NGRP <- 50
NCOL <- 100

test.case <- data.frame(i=1:SIZE)
test.case[["grouping"]] <- sample(NGRP, size=SIZE, replace=TRUE, prob=1/(1:NGRP))
test.case[["y"]] <- sample(c(0, 1), size=SIZE, replace=TRUE, prob=c(0.05, 0.95))

test.formula = y ~ (1 | grouping)

for (i in 1:NCOL) {
    colname <- paste("col", i, sep="")
    test.case[[colname]] <- runif(SIZE)
    test.formula <- update.formula(test.formula, as.formula(paste(". ~ . +", colname)))
}

print(test.formula)

test.model <- glmer(test.formula, data=test.case, family='binomial', verbose=TRUE)

This outputs:

start par. =  1 fn =  19900.78 
At return
eval:  15 fn:      19769.402 par:  0.00000
(NM) 20: f = 19769.4 at           0     <other numbers>
(NM) 40: f = 19769.4 at           0     <other numbers>

I tried setting ncol to other values, and it appears that the number of iterations done is (approximately) 40 per column. Obviously, this becomes a huge pain as I add more columns. Are there tweaks I can make to the optimization algorithm that will reduce the dependence on the number of columns?

Best Answer

One thing you could try is to change the optimizer. See Ben Bolker's comment at this github issue. The nlopt implementation of bobyqa is usually much faster than the default (at least whenever I try it).

library(nloptr)
defaultControl <- list(algorithm="NLOPT_LN_BOBYQA",xtol_rel=1e-6,maxeval=1e5)
nloptwrap2 <- function(fn,par,lower,upper,control=list(),...) {
    for (n in names(defaultControl)) 
      if (is.null(control[[n]])) control[[n]] <- defaultControl[[n]]
    res <- nloptr(x0=par,eval_f=fn,lb=lower,ub=upper,opts=control,...)
    with(res,list(par=solution,
                  fval=objective,
                  feval=iterations,
                  conv=if (status>0) 0 else status,
                  message=message))
}

system.time(test.model <- glmer(test.formula, data=test.case, 
family='binomial', verbose=TRUE))

system.time(test.model2 <- update(test.model,
control=glmerControl(optimizer="nloptwrap2"))

Also, see this answer for more options and this thread from R-sig-mixed-models (which looks more relevant to your issue).

Edit: I gave you some out-of-date info related to nloptr. In lme4 1.1-7 and up, nloptr is automatically imported (see ?nloptwrap). All you have to do is add

control = [g]lmerControl(optimizer = "nloptwrap") # +g if fitting with glmer

to your call.

Related Question