Solved – GLMM optimiser test – optimx.L-BFGS-B doesn’t converge, but the rest do

binomial distributionglmmlme4-nlmemultilevel-analysisr

I am running GLMM using lme4 in R for the first time. I have a complex model (with three main effects and four interactions), as well as a random intercept of participant (as my experiment used a repeated measures design).

Even with this basic random effects structure, I have been having issues with convergence. The model works with the bobyqa optimiser, but I am concerned it may be changing the data and giving me unreliable estimates.

As suggested by the authors of lme4 (on "?converge"), I have checked all possible optimisers. This is what it says re. convergence:

try all available optimizers (e.g. several different implementations of BOBYQA and Nelder-Mead, L-BFGS-B from optim, nlminb, …) via the allFit() function, see ‘5.’ in the examples. While this will of course be slow for large fits, we consider it the gold standard; if all optimizers converge to values that are practically equivalent, then we would consider the convergence warnings to be false positives.

These are the results I am getting:

bobyqa TRUE
Nelder_Mead TRUE
nlminbw TRUE
optimx.L-BFGS-B FALSE
nloptwrap.NLOPT_LN_NELDERMEAD TRUE
nloptwrap.NLOPT_LN_BOBYQA TRUE

So all but one of the models are fitting. The estimates from these are very similar to each other. See the example below.

Is this okay or is this problematic? If this is okay, should I report the results with the bobyqa optimiser? If this is problematic, what analysis would you recommend? ANOVAs are sub-optimal as the data is binary (correct/incorrect), and logistic regression is sub-optimal as the design is repeated measures.

Thank you for your help. Hopefully I've explained the problem in enough detail, but I'm happy to provide more information if necessary.

Thanks.

Best Answer

Don't worry about the single non coverging optimizer: there is no free lunch and no optimizer that works best in all situations. This is why the lme4 team has built-in support for several different optimizers (and you can even use your own custom optimizers).

All of your estimates are fairly close in numeric terms, but just how close they are will depend on your particular dataset and domain.

In terms of which optimizer to use, I would tend to prefer the fit with the best log-likelihood (i.e. biggest -- for negative values, this means closest to zero) because for identical models but with different approximations (here: optimizers), the best log-likelihood indicates the best approximation.

My guess is that bobyqa won't have the best log-likelihood because it is closer to the non-converged than the converged estimates, at least for the first two predictors.

One final thing: the optimizers don't "change" the data. They simply try to iteratively solve the problem you give them (via your model specification). The different optimizers are just different approaches for how to handle each iteration, and the convergence tests largely work by checking how the approximate solution changes from iteration to iteration. In addition to the differing performance between optimizers, which can lead to some truly performing better or worse in a given situation, these convergence tests are themselves approximate and not perfect, so they can also sometimes give false positives. A lot of work has gone into tuning these tests so that they give false negatives hopefully very rarely and false positives not all too often.

Related Solutions

Solved – Help interpreting count data GLMM using lme4 glmer and glmer.nb – Negative binomial versus Poisson

I believe there are some important problems to be addressed with your estimation.

From what I gathered by examining your data, your units are not geographically grouped, i.e. census tracts within counties. Thus, using tracts as a grouping factor is not appropriate to capture spatial heterogeneity as this means that you have the same number of individuals as groups (or to put another way, all your groups have only one observation each). Using a multilevel modelling strategy allows us to estimate individual-level variance, while controlling for between-group variance. Since your groups have only one individual each, your between-group variance is the same as your individual-level variance, thus defeating the purpose of the multilevel approach.

On the other hand, the grouping factor can represent repeated measurements over time. For example, in the case of a longitudinal study, an individual's "maths" scores may be recored yearly, thus we would have a yearly value for each student for n years (in this case, the grouping factor is the student as we have n number of observations "nested" within students). In your case, you have repeated measures of each census tract by decade. Thus, you could use your TRTID10 variable as a grouping factor to capture "between decade variance". This leads to 3142 observations nested in 635 tracts, with approximately 4 and 5 observations per census tract.

As mentioned in a comment, using decade as a grouping factor is not very appropriate, as you have only around 5 decades for each census tract, and their effect can be better captured introducing decade as a covariate.

Second, to determine whether your data ought to be modelled using a poisson or negative binomial model (or a zero inflated approach). Consider the amount of overdispersion in your data. The fundamental characteristic of a Poisson distribution is equidispersion, meaning that the mean is equal to the variance of the distribution. Looking at your data, it is pretty clear that there is much overdispersion. Variances are much much greater than the means.

library(dplyr)    
 dispersionstats <- scaled.mydata %>%
 + group_by(decade) %>%
 + summarise(
 + means = mean(R_VAC),
 + variances = var(R_VAC),
 + ratio = variances/means)

##   dispersionstats
##   # A tibble: 5 x 5
##   decade     means variances     ratio 
##    <int>     <dbl>     <dbl>     <dbl> 
## 1   1970  45.43513   4110.89  90.47822 
## 2   1980 103.52365  17323.34 167.33707 
## 3   1990 177.68038  62129.65 349.67087 
## 4   2000 190.23150  91059.60 478.67784 
## 5   2010 247.68246 126265.60 509.78821

Nonetheless, to determine if the negative binomial is more appropriate statistically, a standard method is to do a likelihood ratio test between a Poisson and a negative binomial model, which suggests that the negbin is a better fit.

library(MASS)
library(lmtest)

modelformula <- formula(R_VAC ~ factor(decade) + P_NONWHT * a_hinc + offset(HU_ln))

poismodel <- glm(modelformula, data = scaled.mydata, family = "poisson")   
nbmodel <- glm.nb(modelformula, data = scaled.mydata)

lrtest(poismodel, nbmodel)

## Likelihood ratio test

##  Model 1: R_VAC ~ factor(decade) + P_NONWHT * a_hinc + offset(HU_ln)  
## Model 2: R_VAC ~ factor(decade) + P_NONWHT * a_hinc + offset(HU_ln)
##   #Df  LogLik Df  Chisq Pr(>Chisq)
## 1   8 -154269
## 2   9  -17452  1 273634  < 2.2e-16 ***
##  ---
## Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

After establishing this, a next test could consider whether the multilevel (mixed model) approach is warranted using similar approach, which suggests that the multilevel version provides a better fit. (A similar test could be used to compare a glmer fit assuming a poisson distribution to the glmer.nb object, as long as the models are otherwise the same.)

library(lme4)

glmmformula <- update(modelformula, . ~ . + (1|TRTID10))

nbglmm <- glmer.nb(glmmformula, data = scaled.mydata)

lrtest(nbmodel, nbglmm)

## Model 1: R_VAC ~ factor(decade) + P_NONWHT * a_hinc + offset(HU_ln)
## Model 2: R_VAC ~ factor(decade) + P_NONWHT + a_hinc + (1 | TRTID10) +
##     P_NONWHT:a_hinc + offset(HU_ln)
##   #Df LogLik Df Chisq Pr(>Chisq)
## 1   9 -17452
## 2  10 -17332  1 239.3  < 2.2e-16 ***
## ---
## Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Regarding the estimates of the poisson and nb models they are actually expected to be very similar to each other, with the main distinction being the standard errors, ie if overdispersion is present, the poisson model tends to provide biased standard errors. Taking your data as an example:

poissonglmm <- glmer(glmmformula, data = scaled.mydata)
summary(poissonglmm)

## Random effects:
##  Groups  Name        Variance Std.Dev.
## TRTID10 (Intercept) 0.2001   0.4473
## Number of obs: 3142, groups:  TRTID10, 635

## Fixed effects:
##                     Estimate Std. Error z value Pr(>|z|)
## (Intercept)        -2.876013   0.020602 -139.60   <2e-16 ***
## factor(decade)1980  0.092597   0.007602   12.18   <2e-16 ***
## factor(decade)1990  0.903543   0.007045  128.26   <2e-16 ***
## factor(decade)2000  0.854821   0.006913  123.65   <2e-16 ***
## factor(decade)2010  0.986126   0.006723  146.67   <2e-16 ***
## P_NONWHT           -0.125500   0.014007   -8.96   <2e-16 ***
## a_hinc             -0.107335   0.001480  -72.52   <2e-16 ***
## P_NONWHT:a_hinc     0.160937   0.003117   51.64   <2e-16 ***
## ---
## Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

summary(nbglmm)
## Random effects:
##  Groups  Name        Variance Std.Dev.
##  TRTID10 (Intercept) 0.09073  0.3012
## Number of obs: 3142, groups:  TRTID10, 635

## Fixed effects:
##                     Estimate Std. Error z value Pr(>|z|)
## (Intercept)        -2.797861   0.056214  -49.77  < 2e-16 ***
## factor(decade)1980  0.118588   0.039589    3.00  0.00274 **
## factor(decade)1990  0.903440   0.038255   23.62  < 2e-16 ***
## factor(decade)2000  0.843949   0.038172   22.11  < 2e-16 ***
## factor(decade)2010  1.068025   0.037376   28.58  < 2e-16 ***
## P_NONWHT            0.020012   0.089224    0.22  0.82253
## a_hinc             -0.129094   0.008109  -15.92  < 2e-16 ***
## P_NONWHT:a_hinc     0.149223   0.018967    7.87 3.61e-15 ***
## ---
## Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Notice how the coefficient estimates are all very similar, the main difference being only the significance of one of your covariates, as well as the difference in the random effects variance, which suggests that the unit-level variance captured by the overdispersion parameter in the nb model (the theta value in the glmer.nb object) captures some of the between tract variance captured by the random effects.

Regarding exponentiated coefficients (and associated confidence intervals), you can use the following:

fixed <- fixef(nbglmm)
confnitfixed <- confint(nbglmm, parm = "beta_", method = "Wald") # Beware: The Wald method is less accurate but much, much faster.

# The exponentiated coefficients are also known as Incidence Rate Ratios (IRR)
IRR <- exp(cbind(fixed, confintfixed)
IRR
##                         fixed      2.5 %     97.5 %
## (Intercept)        0.06094028 0.05458271 0.06803835
## factor(decade)1980 1.12590641 1.04184825 1.21674652
## factor(decade)1990 2.46807856 2.28979339 2.66024515
## factor(decade)2000 2.32553168 2.15789585 2.50619029
## factor(decade)2010 2.90962703 2.70410073 3.13077444
## P_NONWHT           1.02021383 0.85653208 1.21517487
## a_hinc             0.87889172 0.86503341 0.89297205
## P_NONWHT:a_hinc    1.16093170 1.11856742 1.20490048

Final thoughts, regarding zero inflation. There is no multilevel implementation (at least that I am aware of) of a zero inflated poisson or negbin model that allows you to specify an equation for zero inflated component of the mixture. the glmmADMB model lets you estimate a constant zero inflation parameter. An alternative approach would be to use the zeroinfl function in the pscl package, though this does not support multilevel models. Thus, you could compare the fit of a single level negative binomial, to the single level zero inflated negative binomial. Chances are that if zero inflation is not significant for single level models, it is likely that it would not be significant for the multilevel specification.

Addendum

If you are concerned about spatial autocorrelation, you could control for this using some form of geographical weighted regression (though I believe this uses point data, not areas). Alternatively, you could group your census tracts by an additional grouping factor (states, counties) and include this as a random effect. Lastly, and I am not sure if this is entirely feasible, it may be possible to incorporate spatial dependence using, for example, the average count of R_VAC in first order neighbours as a covariate. In any case, prior to such approaches, it would be sensible to determine if spatial autocorrelation is indeed present (using Global Moran's I, LISA tests, and similar approaches).

Best Answer

Related Solutions

Solved – Help interpreting count data GLMM using lme4 glmer and glmer.nb – Negative binomial versus Poisson

Addendum

Related Question