Solved – How to assess the fit of a binomial GLMM fitted with lme4 (> 1.0)

binomial distributionglmmlme4-nlmeresiduals

I have a GLMM with a binomial distribution and a logit link function and I have the feeling that an important aspect of the data is not well represented in the model.

To test this, I would like to know whether or not the data is well described by a linear function on the logit scale. Hence, I would like to know whether the residuals are well-behaved. However, I can not find out at which residuals plot to plot and how to interpret the plot.

Note that I am using the new version of lme4 (the development version from GitHub):

packageVersion("lme4")
## [1] ‘1.1.0’

My question is: How do I inspect and interpret the residuals of a binomial generalized linear mixed models with a logit link function?

The following data represents only 17% of my real data, but fitting already takes around 30 seconds on my machine, so I leave it like this:

require(lme4)
options(contrasts=c('contr.sum', 'contr.poly'))

dat <- read.table("http://pastebin.com/raw.php?i=vRy66Bif")
dat$V1 <- factor(dat$V1)

m1 <- glmer(true ~ distance*(consequent+direction+dist)^2 + (direction+dist|V1), dat, family = binomial)

The simplest plot (?plot.merMod) produces the following:

plot(m1)

enter image description here

Does this already tell me something?

Best Answer

Short answer since I don't have time for better: this is a challenging problem; binary data almost always requires some kind of binning or smoothing to assess goodness of fit. It was somewhat helpful to use fortify.lmerMod (from lme4, experimental) in conjunction with ggplot2 and particularly geom_smooth() to draw essentially the same residual-vs-fitted plot you have above, but with confidence intervals (I also narrowed the y limits a bit to zoom in on the (-5,5) region). That suggested some systematic variation that could be improved by tweaking the link function. (I also tried plotting residuals against the other predictors, but it wasn't too useful.)

I tried fitting the model with all 3-way interactions, but it wasn't much of an improvement either in deviance or in the shape of the smoothed residual curve.

Then I used this bit of brute force to try inverse-link functions of the form $(\mbox{logistic}(x))^\lambda$, for $\lambda$ ranging from 0.5 to 2.0:

## uses (fragile) internal C calls for speed; could use plogis(),
##  qlogis() for readability and stability instead
logitpower <- function(lambda) {
    L <- list(linkfun=function(mu)
              .Call(stats:::C_logit_link,mu^(1/lambda),PACKAGE="stats"),
              linkinv=function(eta)
              .Call(stats:::C_logit_linkinv,eta,PACKAGE="stats")^lambda,
              mu.eta=function(eta) {
                  mu <-  .Call(stats:::C_logit_linkinv,eta,PACKAGE="stats")
                  mu.eta <-  .Call(stats:::C_logit_mu_eta,eta,PACKAGE="stats")
                  lambda*mu^(lambda-1)*mu.eta
              },
              valideta = function(eta) TRUE ,
              name=paste0("logit-power(",lambda,")"))
    class(L) <- "link-glm"
    L
}

I found that a $\lambda$ of 0.75 was slightly better than the original model, although not significantly so -- I may have been overinterpreting the data.

Related Solutions

Solved – Residual variance for glmer

One possible interpretation of the logistic regression model is to state that there is an underlying score $$ y^*_i = x_i'\beta + \epsilon_i, $$ with the observed variable being $$y_i = \left\{ \begin{array}{ll} 1, & y^*_i > 0 \\ 0, & y^*_i \le 0\end{array}\right.$$ This would be the way logistic regression would be introduced in social sciences, as opposed to biostatistics. In this formulation, $\epsilon$ follows a logistic distribution, which does have the variance of $\pi^2/3$. Mixed models stick an additional random effects term into the equation, and introduce the double subscripts, making it $$ y^*_{ij} = x_{ij}'\beta + u_i + \epsilon_{ij}, $$ where $u_i$ is assumed normal because the normal distribution is something that everybody understands. The variance of $u_i$ is estimated by your mixed model package (although without the standard errors; Douglas Bates has a pretty strong stand on it). So the total variance is then $\sigma^2_u + \mathbb{V}[\epsilon] = \sigma^2_u + \pi^2/3$.

In your somewhat more complicated model, you just need to add all the variances of the variance components. It seems weird to me that the strongest effect that you have is that of the interaction, with the magnitudes of the main effects being smaller. See if this makes sense in your application. Also, Laplace approximation is at best a starting point; you need to increase the number of integration points to get accurate estimates of variance components.

Solved – GLMER and Model is nearly unidentifiable: very large eigenvalue

You can easily change the convergence criteria, e.g.

     glmerControl(...,
        check.conv.grad = .makeCC("warning", tol = 1e-2))

or change the optimization parameters via the optCtrl argument, although this is a bit obscure for the bobyqa optimizer (help("bobyqa",package="minqa")). Check help("convergence",package="lme4") for more information about convergence, among other things the fact that

(2) the magnitude of the scaled gradient increases with sample size, so that warnings will occur even for apparently well-behaved fits with large data sets.

If you get similar answers from lme4 and PROC GLIMMIX I would conclude that the warning is a false positive.

Best Answer

Related Solutions

Solved – Residual variance for glmer

Solved – GLMER and Model is nearly unidentifiable: very large eigenvalue

Related Question