Solved – Mixed models with lmer: Residual diagnostics

diagnosticlme4-nlmemixed modelresiduals

I fitted a linear mixed model as follows:

fit=lmer(Time.to.obtain.loan ~ borrower.Gender+ borrowing.Amount + (1|borrower.Country) + (1|borrowing.Sector))

The following was obtained with plot(fit).

The following was obtained with –

qqnorm(residuals(fit))

qqline(residuals(fit))

What do these two plots mean? Does the same set of assumptions
(normality of residuals; homogenity of variance) apply for linear
mixed effects model? Am I right in reading that this model is not
properly specified as it violates normality assumptions?
How should one go about fixing such violations?

Best Answer

The plots but also the name of your outcome variable, Time.to.obtain.loan, suggest that you have a bounded outcome. Do you perhaps have (many) zeros in the Time.to.obtain.loan? If this is the case, indeed assuming a normal distribution would not be optimal. You could give a try to a Beta mixed effects models.

However, again the name of your response variable suggests that you perhaps need to account for censoring occurring in the Time.to.obtain.loan, i.e., some people have not obtained a loan yet. Hence, their time to obtain a loan is right-censored. In this case you would need to go for a survival type of model. For example, you could have a look at the coxme package.

Related Solutions

Solved – Residual Diagnostics and Homogeneity of variances in linear mixed model

I think Question 1 and 2 are interconnected. First, the homogeneity of variance assumption comes from here, $\boldsymbol \epsilon \ \sim \ N(\mathbf{0, \sigma^2 I})$. But this assumption can be relaxed to more general variance structures, in which the homogeneity assumption is not necessary. That means it really depends on how the distribution of $\boldsymbol \epsilon$ is assumed.

Second, the conditional residuals are used to check the distribution of (thus any assumptions related to) $\boldsymbol \epsilon$, whereas the marginal residuals can be used to check the total variance structure.

Solved – Diagnostics for generalized linear (mixed) models (specifically residuals)

This answer is not based on my knowledge but rather quotes what Bolker et al. (2009) wrote in an influential paper in the journal Trends in Ecology and Evolution. Since the article is not open access (although searching for it on Google scholar may prove successful, I thought I cite important passages that may be helpful to address parts of the questions. So again, it's not what I came up with myself but I think it represents the best condensed information on GLMMs (inlcuding diagnostics) out there in a very straight forward and easy to understand style of writing. If by any means this answer is not suitable for whatever reason, I will simply delete it. Things that I find useful with respect to questions regarding diagnostics are highlighted in bold.

Page 127:

Researchers faced with nonnormal data often try shortcuts such as transforming data to achieve normality and homogeneity of variance, using nonparametric tests or relying on the robustness of classical ANOVA to nonnormality for balanced designs [15]. They might ignore random effects altogether (thus committing pseudoreplication) or treat them as fixed factors [16]. However, such shortcuts can fail (e.g. count data with many zero values cannot be made normal by transformation). Even when they succeed, they might violate statistical assumptions (even nonparametric tests make assumptions, e.g. of homogeneity of variance across groups) or limit the scope of inference (one cannot extrapolate estimates of fixed effects to new groups). Instead of shoehorning their data into classical statistical frameworks, researchers should use statistical approaches that match their data. Generalized linear mixed models (GLMMs) combine the properties of two statistical frameworks that are widely used in ecology and evolution, linear mixed models (which incorporate random effects) and generalized linear models (which handle nonnormal data by using link functions and exponential family [e.g. normal, Poisson or binomial] distributions). GLMMs are the best tool for analyzing nonnormal data that involve random effects: all one has to do, in principle, is specify a distribution, link function and structure of the random effects.

Page 129, Box 1:

The residuals indicated overdispersion, so we refitted the data with a quasi-Poisson model. Despite the large estimated scale parameter (10.8), exploratory graphs found no evidence of outliers at the level of individuals, genotypes or populations. We used quasi-AIC (QAIC), using one degree of freedom for random effects [49], for randomeffect and then for fixed-effect model selection.

Page 133, Box 4:

Here we outline a general framework for constructing a full (most complex) model, the first step in GLMM analysis. Following this process, one can then evaluate parameters and compare submodels as described in the main text and in Figure 1.

Specify fixed (treatments or covariates) and random effects (experimental, spatial or temporal blocks, individuals, etc.). Include only important interactions. Restrict the model a priori to a feasible level of complexity, based on rules of thumb (>5–6 random-effect levels per random effect and >10–20 samples per treatment level or experimental unit) and knowledge of adequate sample sizes gained from previous studies [64,65].

Choose an error distribution and link function (e.g. Poisson distribution and log link for count data, binomial distribution and logit link for proportion data).

Graphical checking: are variances of data (transformed by the link function) homogeneous across categories? Are responses of transformed data linear with respect to continuous predictors? Are there outlier individuals or groups? Do distributions within groups match the assumed distribution?

Fit fixed-effect GLMs both to the full (pooled) data set and within each level of the random factors [28,50]. Estimated parameters should be approximately normally distributed across groups (group-level parameters can have large uncertainties, especially for groups with small sample sizes). Adjust model as necessary (e.g. change link function or add covariates).

Fit the full GLMM. Insufficient computer memory o r too slow: reduce model complexity. If estimation succeeds on a subset of the data, try a more efficient estimation algorithm (e.g. PQL if appropriate). Failure to converge (warnings or errors): reduce model complexity or change optimization settings (make sure the resulting answers make sense). Try other estimation algorithms. Zero variance components or singularity (warnings or errors): check that the model is properly defined and identifiable (i.e. all components can theoretically be estimated). Reduce model complexity. Adding information to the model (additional covariates, or new groupings for random effects) can alleviate problems, as will centering continuous covariates by subtracting their mean [50]. If necessary, eliminate random effects from the full model, dropping (i) terms of less intrinsic biological interest, (ii) terms with very small estimated variances and/or large uncertainty, or (iii) interaction terms. (Convergence errors or zero variances could indicate insufficient data.)

Recheck assumptions for the final model (as in step 3) and check that parameter estimates and confidence intervals are reasonable (gigantic confidence intervals could indicate fitting problems). The magnitude of the standardized residuals should be independent of the fitted values. Assess overdispersion (the sum of the squared Pearson residuals should be $\chi^2$ distributed [66,67]). If necessary, change distributions or estimate a scale parameter. Check that a full model that includes dropped random effects with small standard deviations gives similar results to the final model. If different models lead to substantially different parameter estimates, consider model averaging.

Residuals plots should be used to assess overdispersion and transformed variances should be homogeneous across categories. Nowhere in the article was mentioned that residuals are supposed to be normally distributed.

I think the reason why there are contrasting statements reflects that GLMMs (page 127-128)...

...are surprisingly challenging to use even for statisticians. Although several software packages can handle GLMMs (Table 1), few ecologists and evolutionary biologists are aware of the range of options or of the possible pitfalls. In reviewing papers in ecology and evolution since 2005 found by Google Scholar, 311 out of 537 GLMM analyses (58%) used these tools inappropriately in some way (see online supplementary material).

And here are a few full worked examples using GLMMs including diagnostics.

I realize that this answer is more like a comment and should be treated as such. But the comment section doesn't allow me to add such a long comment. Also since I believe this paper is of value for this discussion (but unfortunately behind a pay-wall), I thought it would be useful to quote important passages here.

Cited papers:

[15] - G.P. Quinn, M.J. Keough (2002): Experimental Design and Data Analysis for Biologists, Cambridge University Press.

[16] - M.J. Crawley (2002): Statistical Computing: An Introduction to Data Analysis Using S-PLUS, John Wiley & Sons.

[28] - J.C. Pinheiro, D.M. Bates (2000): Mixed-Effects Models in S and S-PLUS, Springer.

[49] - F. Vaida, S. Blanchard (2005): Conditional Akaike information for mixed-effects models. Biometrika, 92, pp. 351–370.

[50] - A. Gelman, J. Hill (2006): Data Analysis Using Regression and Multilevel/Hierarchical Models, Cambridge University Press.

[64] - N.J. Gotelli, A.M. Ellison (2004): A Primer of Ecological Statistics, Sinauer Associates.

[65] - F.J. Harrell (2001): Regression Modeling Strategies, Springer.

[66] - J.K. Lindsey (1997): Applying Generalized Linear Models, Springer.

[67] - W. Venables, B.D. Ripley (2002): Modern Applied Statistics with S, Springer.

Best Answer

Related Solutions

Solved – Residual Diagnostics and Homogeneity of variances in linear mixed model

Solved – Diagnostics for generalized linear (mixed) models (specifically residuals)

Related Question