Solved – Unexpected residuals plot of mixed linear model using lmer (lme4 package) in R

mixed modelrresiduals

I have conducted an experiment with multiple (categorical) conditions per subject, and multiple subject measurements.

My data-frame in short: A subject has one property, is_frisian which is either 0 or 1 depending on the subject. And it is tested for two conditions, person and condition. The measurement variable is error, which is either 0 or 1.

My mixed linear model in R is:

> model <- lmer(error~is_frisian*condition*person+(1|subject_id), data=output)

However, the residuals plot of this model gives an unexpected (?) result.

Residuals lmer model

I was taught that this plot should show randomly scattered points, and they should be normal distributed. When plotting the density of the fitted and the residuals, it shows a reasonable normal distribution. The lines you can see in the graph, however, how is this to be explained? And is this okay?

The only thing I could come up with is that the graph has two lines due to the categorical variables. The output variable error is either 0 or 1. But I do not have that much knowledge of the underlying system to confirm this. And then again, the lines also seem to have a low negative slope, is this then perhaps a problem?

UPDATE:

> model <- glmer(error~is_frisian*condition*person + (1|subject_id), data=output, family='binomial')
> binnedplot(fitted(model),resid(model))

Gives the following result:

binned residual plot

FINAL EDIT:

The density-plots have been omitted, they have nothing to do with satisfaction of assumptions in this case. For a list of assumptions on logistic regression (when using family=binomial), see here at statisticssolutions.com

Best Answer

Your residual structure is totally expected with this model specification and an indication of an ill-specified model. What you basically are trying to do is to fit a linear line through points that can only take values of 0 and 1 on the $y$-axis.

Let's look at a simple example with arbitrarily generated variables:

#-----------------------------------------------------------------------------
# Generate random data for logistic regression
#-----------------------------------------------------------------------------

set.seed(123)
x <- rnorm(1000)          
z <- 1 + 2*x
pr <- 1/(1+exp(-z))
y <- rbinom(1000,1, pr)

#-----------------------------------------------------------------------------
# Plot the data
#-----------------------------------------------------------------------------

par(bg="white", cex=1.2)
plot(y~x, las=1, ylim=c(-0.1, 1.3))

#-----------------------------------------------------------------------------
# Fit a linear regression (nonsensical) and plot the fit
#-----------------------------------------------------------------------------

linear.mod <- lm(y~x)
segments(-2.32146, 0, 1.24196, 1, col="steelblue", lwd=2)
segments(1.24196, 1, 100, 28.71447, col="red", lwd=2)
segments(-100, -27.41153, -2.32146, 0, col="red", lwd=2)

IllFit

As you can see, a linear line is fitted through the data. One problem of this is that the line predicts outcomes that are outside the interval $[0,1]$ (illustrated by the red lines outside that interval). Let's have a look at the residuals:

#-----------------------------------------------------------------------------
# Add the residual lines
#-----------------------------------------------------------------------------

x.y0 <- sample(which(y==0), 50, replace=F)
x.y1 <- sample(which(y==1), 50, replace=F)

pre <- predict(linear.mod)

segments(x[x.y0], y[x.y0], x[x.y0], pre[x.y0], col="red", lwd=2)
points(x[x.y0], y[x.y0], pch=16, col="red", las=1)

segments(x[x.y1], y[x.y1], x[x.y1], pre[x.y1], col="blue", lwd=2)
points(x[x.y1], y[x.y1], pch=16, col="blue", las=1)

illmodresiduals

I randomly picked some values to show the pattern. The red and blue lines are depicting the residuals, which is the difference between the predicted value of the line and the actual observed value (red and blue dots). The blue lines correspond to the residuals where $y=1$ whereas the red residuals correspond to the situation where $y=0$. Because the outcome can only be either 0 or 1, the residuals are simply the distances between the regression line and either 0 or 1. The residuals take exactly the form that you see in your data:

#-----------------------------------------------------------------------------
# Plot the residuals
#-----------------------------------------------------------------------------

res.linear <- residuals(linear.mod, type="response")

par(bg="white", cex=1.2)
plot(predict(linear.mod)[y==0], res.linear[y==0], las=1,
     xlab="Fitted values", ylab = "Residuals",
     ylim = max(abs(res.linear))*c(-1,1), xlim=c(-0.4, 1.6), col="red")
points(predict(linear.mod)[y==1], res.linear[y==1], col="blue")
abline(h = 0, lty = 2)

IllModelResidualplot

The colors correspond to the residuals shown above: the blue dots are the residuals where $y=1$ and the red dots are the residuals where $y=0$. In normal linear regression, the residuals are assumed to be approximately normally distributed. But in this case, the residuals can hardly be normal. They are binomial.

We need a transformation that transformes the probability, which is bound within $[0,1]$ into a variable that ranges over $(-\infty, \infty)$. One such transformation is the logit (this is not the only possibility: we could also use probit or the complementary log-log function). Let's fit a logistic regression with a logit-link and again plot the binned residuals (explained on page 97 by Gelman and Hill (2007)). Plotting the raw residuals vs. fitted values are generally not useful after logistic regression:

#-----------------------------------------------------------------------------
# Fit a logistic regression
#-----------------------------------------------------------------------------

glm.fit <- glm(y~x, family=binomial(link="logit"))

#-----------------------------------------------------------------------------
# Plot the binned residuals as recommended by Gelman and Hill (2007)
#-----------------------------------------------------------------------------

library(arm)
par(bg="white", cex=1.2, las=1)
binnedplot(predict(glm.fit), resid(glm.fit), cex.pts=1, col.int="black")

BinnedResiduals

The residuals in logistic regression can be define -$~$as in linear regression$~$- as observed minus expected values: $$ \text{residual}_{i}=y_{i}-\mathrm{E}(y_{i}|X_{i})=y_{i}-\text{logit}^{-1}(X_{i}\beta) $$ Because the data $y_{i}$ are discrete, so are the residuals. In the plot above, the residuals are binned by dividing the data into categories based on their fitted values, and are then plotted against the average residual versus the average fitted value for each category (bin). The lines indicate $\pm2$ standard-error bounds, within which one we would expect about 95% of the binned residuals to fall, under the assumption that the model is true.

So the remedy for your immediate problem is to fit a mixed effects logistic regression by typing:

model <- glmer(error~is_frisian*condition*person+(1|subject_id),
data=output, family="binomial")

For a good introduction to mixed effects logistic regression in R, see here. For a good overview of diagnostics in linear and generalized linear models, see here.

Related Solutions

Solved – What resolution should I be using for residuals vs fitted values plot from a linear regression

Well correlation is a measure of linear association: given that for every value of x (the fitted values), the value of y (the residuals) is constant; the slope is 0. I.e., there's no correlation.

A significance test of the correlation between the fitted values and the residuals confirms this:

cor.test(fitted(lm_longitude), resid(longitude))

Pearson's product-moment correlation

data:  fitted and residual
t = 0, df = 13, p-value = 1
alternative hypothesis: true correlation is not equal to 0

95 percent confidence interval:
-0.5122628  0.5122628

sample estimates:
cor 
1.763983e-14

So you're pretty justified in using the second interpretation. The fact that a pattern seems to emerge on the order of 10^-4 is likely just noise. The scale you use to present the graph of the fitted values against the residuals is less important: use whatever provides a clear display of the data. Either way, there's still no correlation between the two.

Still worried there's a relation? Let's try a third degree polynomial regression, then where res is a vector of the residuals and ftd a vector of the fitted values:

Call:
lm(formula = res ~ ftd + I(ftd^2) + I(ftd^3))

Residuals:
Min         1Q     Median         3Q        Max 
-2.241e-04 -3.768e-05  8.900e-08  3.482e-05  1.738e-04 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.0007511  0.0207611  -0.036    0.972
ftd         -0.0109064  0.0137376  -0.794    0.444
I(ftd^2)     0.0003520  0.0090690   0.039    0.970
I(ftd^3)     0.0047866  0.0060009   0.798    0.442

Residual standard error: 9.891e-05 on 11 degrees of freedom
Multiple R-squared:  0.3314,    Adjusted R-squared:  0.1491 
F-statistic: 1.818 on 3 and 11 DF,  p-value: 0.2022

Turns out, there's no significant relationship between these, even if we assume nonlinearity.

It doesn't matter on what scale you visualize the data: objectively, that changes nothing whatsoever. Regardless of how you look at it, either (1) there's really no relationship between the residuals and fitted values or (2) you don't have nearly enough data to conclusively demonstrate that there is.

Solved – Correlation between standardized residuals and fitted values in a linear mixed effect model: Course of action

OK based on this question/answer which states that

A strong correlation [between residuals and fitted values] is not necessarily cause for alarm. This may simply means the underlying process is noisy. However, a low R2 (and hence high correlation between error and dependent) may be due to model misspecification.

and because I can't see any misspecification in my formula I decided in favor of M3. Fortunately M1, M2 and M3 give the same conclusions in terms of which fixed effects are statistically significant.

It seems that the high correlation is simply due to the low prediction value of my factor variables. Not sure exactly why this wasn't evident in M0, M1 and M2 but then again the range of fitted values was also much lower.

Best Answer

Related Solutions

Solved – What resolution should I be using for residuals vs fitted values plot from a linear regression

Solved – Correlation between standardized residuals and fitted values in a linear mixed effect model: Course of action

Related Question