Multilevel Analysis – Why Does This 3-Level Multilevel Logistic Regression Fail to Converge?

lme4-nlmemixed modelmultilevel-analysis

I'm running a multilevel logistic regression in which I estimate the probability the authors of scientific papers will make a particular sort of error in reporting a null hypothesis significance test (specifically, an error whereby the degrees of freedom, test statistic, and $p$-value are inconsistent with each other).

The outcome is a binary variable error indicating whether the statistic was reported correctly. The predictors are:

  • type of statistic ($t$-, $F$-, $r$-, and $\chi^2$)
  • the year in which the paper was published, and also
  • a binary categorisation that indicates whether the NHST was focal to the paper's aims or was instead peripheral (e.g. a manipulation check, or a test of some assumption).

The data has a nested structure whereby each significance test is nested in a paper, and each paper is nested in a journal.

If I ignore the nesting of articles in journals the 2-level model converges and I get results that look rather sensible.

library(lme4)

dat <- read.csv("error_data.csv")

m1 <- glmer("error ~ 1 + year + categorisation + statistic + (1 | journalID) ",
            data = dat,
            family = binomial(link = "logit"))

summary(m1)

I've been advised that the nesting of articles in journals should be taken into account, and this makes sense to me. However, when I run the 3-level model with that nesting included, I get a convergence error.

m2 <- glmer("error ~ 1 + year + categorisation + statistic + (1 | journalID/articleID) ",
            data = dat,
            family = binomial(link = "logit"))

summary(m2)

I get the message

Warning message:
In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv,  :
  Model failed to converge with max|grad| = 4.25558 (tol = 0.002, component 1)

While calling summary(m2) gives the somewhat reassuring message optimizer (Nelder_Mead) convergence code: 0 (OK) there would still seem to be a problem here since the parameter estimates have changed wildly between m1 and m2, and to me they no longer make much sense.

Why is this happening, and what should I do? So that anyone reading this can reproduce the issue, I have uploaded the CSV file here. My guess is it relates to the fact that most articles have only 1 or 2 data points sampled. I'm reluctant to adopt some alternate way forward (e.g. a Bayesian solution) both because it would make things more complicated and because I already preregistered the current approach.

Update: with the help of some links @KrisBae provided, I've tried a few things, thus far to no avail. However, I mention some things that have been tried and the error messages that resulted, in case they point in the direction of a solution.

  • Rescaling and/or centering the data doesn't fix the problem
  • I checked and neither of the random intercepts has near-zero variance
  • The fit of m2 isn't singular or near-singular
  • The gradient calculations come out somewhat different using the numDeriv package, but I'm not sure what if anything this means
  • Restarting the fit did not fix the problem
  • Increasing the number of iterations just caused an error message PIRLS step-halvings failed to reduce deviance in pwrssUpdate
  • Changing the optimizer to bobyqa for both phases resulted in another convergence failure and a message Model is nearly unidentifiable: very large eigenvalue

Setting nAGQ=0 does cause the model to converge – according to the documentation this indicates "the number of points per axis for evaluating the adaptive Gauss-Hermite approximation to the log-likelihood". This isn't recommended by any of the documents and is just an idea that occurred to me. From other posts on this site I see setting nAGQ=0 is not generally encouraged, and there doesn't seem yet to be an answer to the question of when it is OK. I guess in addition to wanting to know why my model doesn't converge I am also curious about why setting nAGQ to 0 does make it converge.

As an experiment I tried randomly sampling 1500 of my 2086 cases and running the model with nAGQ = 1. I tried that 30 times, and found I got 22 singular fits (here the parameter estimates always seemed crazy), 5 "Model failed to converge" (here sometimes the parameter estimates seemed sensible, and sometimes not), and 3 fits that converged without warnings (here the parameter estimates always seemed reasonably sensible, but they varied a bit). Broadly similar results ensued when I instead sampled 1000 of 2086 cases.

Best Answer

I'm going to answer this with two pictures and a lot of code.

The underlying difficulty with your model is that you have a very uneven distribution of samples, with very few samples in the early years .

Points (areas proportional to sample size; some invisible because the samples are very small) show proportions by statistic/year/categorisation and exact binomial CIs: blue curves are loess fits to the overall data set (not split by statistic). In a lot of the early years, there are very few observations, all of which are zero; this is going to lead to problems very similar to complete separation. Centering the year helps with this a little bit.

plot of means by category

Results of plotting allFit: the differences in the fixed parameter estimates are probably negligible relative to the scale of the confidence intervals. The intercept parameter is of very large magnitude, as is the standard deviation of the articleID:journalID random effect (the CIs are also probably very large, I didn't compute profile CIs ...) The journalID SD is effectively zero (a singular model).

Overall, I would use the results from nloptr BOBYQA (the best negative log-likelihood). I am definitely concerned about the validity/accuracy of the Laplace approximation; about half of the journalID groups have fewer than 5 observations, and all of the journalID:articleID groups do. (A usual rule of thumb for adequacy of the Laplace approximation for binary data is that the minimum of the number of zeros and the number of ones per group should be greater than 5-10 ...). I tried to run GLMMadaptive::mixed_model() (because glmer can't do nAGQ>1 with multiple random effects) but hit an error ... (it might be possible to get around this by defining the journalID:articleID interaction explicitly and dropping empty levels ... ??) (update: I think this is because the grid on which we would need to evaluate the quadrature is impossibly large (1800^n) ...)

allFit results

library(lme4)
library(GLMMadaptive)
library(ggplot2); theme_set(theme_bw())
library(colorspace)
library(tidyverse)

dat <- read.csv("CV543932.csv")
## small adjustments; in particular, scaling/centering year helps
##  with stability
dat <- transform(dat,
                 error = as.numeric(error),
                 categorisation = factor(categorisation),
                 year = drop(scale(year)))

m2 <- glmer(error ~ 1 + year + categorisation + statistic +
              (1 | journalID/articleID),
            data = dat,
            family = binomial(link = "logit"))

summary(m2)

aa <- allFit(m2, parallel = "multicore", ncpus = 6)

try(
m3 <- mixed_model(error ~ 1 + year + categorisation + statistic,
                  random = ~1 | journalID/articleID,
                  data = dat,
                  family = binomial(link = "logit"))
)
## Error in rep.int(rep.int(seq_len(nx), rep.int(rep.fac, nx)), orep) :
##   invalid 'times' value

save(aa, m2, file = "CV54932.rda")

This is using the development version of broom.mixed (remotes::install_github("bbolker/broom.mixed")), where I just added tidiers for allFit objects.

tt <- suppressWarnings(
    ## tidy() only has models that succeeded at some level
    left_join(tidy(aa, conf.int = TRUE), glance(aa), by = "optimizer")
    ## keep only 'adequate' models
    %>% filter(NLL_rel < 0.3)
    ## adjustments for plotting
    %>% mutate(across(term, ~ifelse(!is.na(group),
                                    paste(term,group, sep = "."),
                                    term)))
    %>% select(optimizer:estimate, conf.low, conf.high, NLL_rel)
    %>% mutate(across(optimizer, ~fct_reorder(., NLL_rel)))
    %>% mutate(across(term, fct_inorder))
)

ggplot(tt,
       aes(y=optimizer, x=estimate, xmin=conf.low, xmax=conf.high, colour = NLL_rel)) +
  geom_point() +
  geom_linerange() +
  facet_wrap(~term, scale="free")
prop_cl_binom <- function(x, ...)  {
  bb <- binom.test(sum(x), length(x))
  data.frame(y = mean(x), ymin = bb$conf.int[1], ymax = bb$conf.int[2])
}

prop_size_binom <- function(x, ...)  {
  data.frame(y = mean(x), size = sum(x))
}

pd <- position_dodge(width=0.5)
ggplot(dat, aes(year, error, colour = statistic, shape = categorisation)) +
  facet_wrap(~categorisation) +
  stat_summary(fun.data = prop_cl_binom,
               position = pd,
               geom = "linerange") +
  stat_summary(fun.data = prop_size_binom,
               position = pd,
               alpha = 0.5,
               geom = "point",
               show.legend = c(size = TRUE)) +
  ## scale_size_area() +
  scale_colour_discrete_qualitative() +
  ## scale_size_area(trans="log10") +
  scale_y_continuous(limits=c(0,NA), oob = scales::squish) +
  geom_smooth(method="loess",
              ## method.args = list(family=binomial),
              aes(group = 1),
              formula = y~x)


ggsave("CV543932_2.png", width=10, height=6)
Related Question