Solved – Casewise diagnostics and testing assumptions for a mixed effect logistic regression in R

assumptionscooks-distancelme4-nlmelogisticregression

I am modelling a binary outcome (Buried), that has two predictors: Offset (a 3 level factor) and width (continuous predictor). In addition, multiple data points came from the same unit — a chamber, with 27 chambers in total — so I have included random intercepts that vary by chamber to account for the non-independence of data points from the same chamber.

I have run my model using the glmer function in the lme4 package in R. My model is:

    ball3=glmer(Buried~Offset+Width_mm+(1|Chamber), family=binomial, data=ballData)

I am now up to checking the assumptions of the model. I am able to get the fitted values and the residuals for this model using the following:

    fitted(ball3)
    resid(ball3)

When I try to get other casewise statistics using the following commands:

    standardised.resid=rstandard(ball3)
    studentised.resid=rstudent(ball3)
    dfbeta=dfbeta(ball3)
    dffit=dffits(ball3)
    leverage=hatvalues(ball3)

I get the following error message (using rstandard as an example):

Error in UseMethod("rstandard") :
no applicable method for 'rstandard' applied to an object of class "c('glmerMod', 'merMod')"

Likewise, when I try to get the Variance Inflation Factor (VIF) to check for multicollinearity, I get the following error (top line is the command, bottom is error):

    vif(ball3)

Error in UseMethod("determinant") :
no applicable method for 'determinant' applied to an object of class "c('dpoMatrix', 'dsyMatrix', 'ddenseMatrix', 'symmetricMatrix', 'dMatrix', 'denseMatrix', 'compMatrix', 'Matrix', 'xMatrix', 'mMatrix')"
In addition: Warning message:
In vif.default(ball3) : No intercept: vifs may not be sensible.

My questions are:

Is there a way to get casewise diagnostics (such as standardised residuals, cook's distance etc) for my model (a mixed effect logistic regression modeled using glmer)?
How can I calculate the vif statistic for this model?
Would my model have the same assumption of homoscedasiticy, and normality of the residuals as in a simple linear regression. If so, how would I test this given that the plot of the residuals vs fitted, and a qq plot of the residuals would look a bit odd given the binary nature of the outcome.
Are there any other important assumptions that I should check for my model?

My grasp of statstics is rather basic, so any help is appreciated.

Thanks Ben for pointing me to the influence.ME package. I have pasted in the code that I used to estimate the dfBeta values and Cook’s distance because 1) it would be great if someone could check what I have done and correct me if I am wrong and 2) it might be useful to someone else who has a similar model where they want to check for influential points.

To calculate the parameter estimates without each data point, along with several other values that are needed to then calculate influence stats:

    influence=influence(ball3, obs = TRUE)

The obs=TRUE, means that single observations – rather than groups – are deleted from the model. Then use the influence object created above to estimate Cooks distance:

    cooks=cooks.distance(influence, sort=TRUE)

The sort=TRUE sorts the Cook's distance values, making it easy to find large values.
Then find the dfbeta values:

    dfbetas(influence, sort=TRUE, to.sort="Width_mm", abs=FALSE)

Here I have sorted the dfbeta values for the parameter “Width_mm”. The dfbeta values for the other parameters in the model are also given, but the data frame is sorted on the dfbeta values for the “Width_mm” parameter.

One thing to note however, when I ran the first line of code, I did get the following warning:

    Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv,  :
      Model failed to converge with max|grad| = 0.00122035 (tol = 0.001, component 1)

However cooks distance and dfbeta values were returned for all observations in the data set. So I am not too sure what has happened here.

Best Answer

check out the influence.ME package. It's designed for quantifying group-level rather than observation-level influence, but it works for the latter (specify obs=TRUE in influence()); it can also be pretty slow because it re-estimates the model for each case (e.g., 15 seconds on my laptop to compute observation-level influences for a relatively small LMM fitted to the 144-row Penicillin data set).
for VIF (which I'm not wild about), I'm not sure: this question has been asked and so far not answered. On the other hand, a slightly more general version was asked and answered here (pointing to this Github repository).
In general diagnostics on binary models are difficult: the conceptual problem here is not specific to mixed models but applies to binary GLMs in general. See this question and this question.
I would consider checking out this set of examples for GLMM diagnostics in R.

Related Solutions

Solved – Residuals for logistic regression and Cook’s distance

I don't know if I can give you a complete answer, but I can give you some thoughts that may be helpful. First, all statistical models / tests have assumptions. However, logistic regression very much does not assume the residuals are normally distributed nor that the variance is constant. Rather, it is assumed that the data are distributed as a binomial, $\mathcal{B}(n_{x_i},p_{x_i})$, that is, with the number of Bernoulli trials equal to the number of observations at that exact set of covariate values and with the probability associated with that set of covariate values. Remember that the variance of a binomial is $np(1-p)$. Thus, if the $n$'s vary at different levels of the covariate, the variances will as well. Further, if any of the covariates are at all related to the response variable, then the probabilities will vary, and thus, so will the variances. These are important facts about logistic regression.

Second, model comparisons are usually performed between models with different specifications (for example, with different sets of covariates included), not over different subsets of the data. To be honest, I am not sure how that would properly be done. With a linear model, you could look at the 2 $R^2$s to see how much better the fit is with the aberrant data excluded, but this would only be descriptive, and you should know that $R^2$ would have to go up. With logistic regression, the standard $R^2$ cannot be used, however. There are various 'pseudo-$R^2$s' that have been developed to provide similar information, but they are often considered to be flawed and are not often used. For an overview of the different pseudo-$R^2$s that exist, see here. For some discussion, and criticism, of them, see here. Another possibility might be to jackknife the betas with and without the outliers included to see how excluding them contributes to stabilizing their sampling distributions. Once again, this would only be descriptive (i.e., it wouldn't constitute a test to tell you which model--er, subset of your data--to prefer) and the variance would have to go down. These things are true, for both pseudo-$R^2$s and the jackknifed distributions, because you selected those data to exclude based on the fact that they appear extreme.

Solved – Mixed effects model error message: Model is nearly unidentifiable: large eigenvalue ratio

1) What does the error message mean and what can I do to fix the problem? I dont think I can rescale my variables because they are all factors.

It means that the curvature of the likelihood surface is very flat in some direction. You might have complete separation (do you have estimated parameter values larger than 5 or so??) Alternatively it might be a false positive. See ?convergence for ways of testing (e.g. if you fit with more than one optimizer and get the same answer, you're probably OK). It would be interesting to see a reproducible example.

2) I am including the random factor simply to control for any variation in the outcome variable across chambers so that I can test for the main effect of Offset. Is it correct to include both random slopes and intercepts to control for variation due to Chamber (i.e. (Offset|Chamber)), or is including only random intercepts sufficient to control for variation across Chambers (i.e. (1|Chamber)).

This is a bit of a can of worms: see e.g. my answer here. Generally the advice is "include in the model all random effects that can be estimated in the experimental design, if the data allow it/it doesn't make the model too complicated".

The updated information on your fit shows that you have a singular fit (as shown by the -1.00 correlation between (Intercept) and Offset3). That definitely makes a reasonable justification for simplifying the model to (1|Subject) ...

Random effects:
 Groups  Name        Variance Std.Dev. Corr       
 Chamber (Intercept) 0.3795   0.6160              
         Offset2     0.8254   0.9085   -0.50      
         Offset3     0.3795   0.6160   -1.00  0.50

Best Answer

Related Solutions

Solved – Residuals for logistic regression and Cook’s distance

Solved – Mixed effects model error message: Model is nearly unidentifiable: large eigenvalue ratio

Related Question