Solved – Unexpected residuals plot of mixed linear model using lmer (lme4 package) in R

mixed modelrresiduals

I have conducted an experiment with multiple (categorical) conditions per subject, and multiple subject measurements.

My data-frame in short: A subject has one property, is_frisian which is either 0 or 1 depending on the subject. And it is tested for two conditions, person and condition. The measurement variable is error, which is either 0 or 1.

My mixed linear model in R is:

> model <- lmer(error~is_frisian*condition*person+(1|subject_id), data=output)

However, the residuals plot of this model gives an unexpected (?) result.

Residuals lmer model

I was taught that this plot should show randomly scattered points, and they should be normal distributed. When plotting the density of the fitted and the residuals, it shows a reasonable normal distribution. The lines you can see in the graph, however, how is this to be explained? And is this okay?

The only thing I could come up with is that the graph has two lines due to the categorical variables. The output variable error is either 0 or 1. But I do not have that much knowledge of the underlying system to confirm this. And then again, the lines also seem to have a low negative slope, is this then perhaps a problem?

UPDATE:

> model <- glmer(error~is_frisian*condition*person + (1|subject_id), data=output, family='binomial')
> binnedplot(fitted(model),resid(model))

Gives the following result:

binned residual plot

FINAL EDIT:

The density-plots have been omitted, they have nothing to do with satisfaction of assumptions in this case. For a list of assumptions on logistic regression (when using family=binomial), see here at statisticssolutions.com

Best Answer

Your residual structure is totally expected with this model specification and an indication of an ill-specified model. What you basically are trying to do is to fit a linear line through points that can only take values of 0 and 1 on the $y$-axis.

Let's look at a simple example with arbitrarily generated variables:

#-----------------------------------------------------------------------------
# Generate random data for logistic regression
#-----------------------------------------------------------------------------

set.seed(123)
x <- rnorm(1000)          
z <- 1 + 2*x
pr <- 1/(1+exp(-z))
y <- rbinom(1000,1, pr)

#-----------------------------------------------------------------------------
# Plot the data
#-----------------------------------------------------------------------------

par(bg="white", cex=1.2)
plot(y~x, las=1, ylim=c(-0.1, 1.3))

#-----------------------------------------------------------------------------
# Fit a linear regression (nonsensical) and plot the fit
#-----------------------------------------------------------------------------

linear.mod <- lm(y~x)
segments(-2.32146, 0, 1.24196, 1, col="steelblue", lwd=2)
segments(1.24196, 1, 100, 28.71447, col="red", lwd=2)
segments(-100, -27.41153, -2.32146, 0, col="red", lwd=2)

IllFit

As you can see, a linear line is fitted through the data. One problem of this is that the line predicts outcomes that are outside the interval $[0,1]$ (illustrated by the red lines outside that interval). Let's have a look at the residuals:

#-----------------------------------------------------------------------------
# Add the residual lines
#-----------------------------------------------------------------------------

x.y0 <- sample(which(y==0), 50, replace=F)
x.y1 <- sample(which(y==1), 50, replace=F)

pre <- predict(linear.mod)

segments(x[x.y0], y[x.y0], x[x.y0], pre[x.y0], col="red", lwd=2)
points(x[x.y0], y[x.y0], pch=16, col="red", las=1)

segments(x[x.y1], y[x.y1], x[x.y1], pre[x.y1], col="blue", lwd=2)
points(x[x.y1], y[x.y1], pch=16, col="blue", las=1)

illmodresiduals

I randomly picked some values to show the pattern. The red and blue lines are depicting the residuals, which is the difference between the predicted value of the line and the actual observed value (red and blue dots). The blue lines correspond to the residuals where $y=1$ whereas the red residuals correspond to the situation where $y=0$. Because the outcome can only be either 0 or 1, the residuals are simply the distances between the regression line and either 0 or 1. The residuals take exactly the form that you see in your data:

#-----------------------------------------------------------------------------
# Plot the residuals
#-----------------------------------------------------------------------------

res.linear <- residuals(linear.mod, type="response")

par(bg="white", cex=1.2)
plot(predict(linear.mod)[y==0], res.linear[y==0], las=1,
     xlab="Fitted values", ylab = "Residuals",
     ylim = max(abs(res.linear))*c(-1,1), xlim=c(-0.4, 1.6), col="red")
points(predict(linear.mod)[y==1], res.linear[y==1], col="blue")
abline(h = 0, lty = 2)

IllModelResidualplot

The colors correspond to the residuals shown above: the blue dots are the residuals where $y=1$ and the red dots are the residuals where $y=0$. In normal linear regression, the residuals are assumed to be approximately normally distributed. But in this case, the residuals can hardly be normal. They are binomial.

We need a transformation that transformes the probability, which is bound within $[0,1]$ into a variable that ranges over $(-\infty, \infty)$. One such transformation is the logit (this is not the only possibility: we could also use probit or the complementary log-log function). Let's fit a logistic regression with a logit-link and again plot the binned residuals (explained on page 97 by Gelman and Hill (2007)). Plotting the raw residuals vs. fitted values are generally not useful after logistic regression:

#-----------------------------------------------------------------------------
# Fit a logistic regression
#-----------------------------------------------------------------------------

glm.fit <- glm(y~x, family=binomial(link="logit"))

#-----------------------------------------------------------------------------
# Plot the binned residuals as recommended by Gelman and Hill (2007)
#-----------------------------------------------------------------------------

library(arm)
par(bg="white", cex=1.2, las=1)
binnedplot(predict(glm.fit), resid(glm.fit), cex.pts=1, col.int="black")

BinnedResiduals

The residuals in logistic regression can be define -$~$as in linear regression$~$- as observed minus expected values: $$ \text{residual}_{i}=y_{i}-\mathrm{E}(y_{i}|X_{i})=y_{i}-\text{logit}^{-1}(X_{i}\beta) $$ Because the data $y_{i}$ are discrete, so are the residuals. In the plot above, the residuals are binned by dividing the data into categories based on their fitted values, and are then plotted against the average residual versus the average fitted value for each category (bin). The lines indicate $\pm2$ standard-error bounds, within which one we would expect about 95% of the binned residuals to fall, under the assumption that the model is true.

So the remedy for your immediate problem is to fit a mixed effects logistic regression by typing:

model <- glmer(error~is_frisian*condition*person+(1|subject_id),
data=output, family="binomial")

For a good introduction to mixed effects logistic regression in R, see here. For a good overview of diagnostics in linear and generalized linear models, see here.

Related Question