Regression – What Does it Mean When There is a Pattern in Residuals Related to the Dependent Variable

diagnosticlinear modelregression

I created a linear regression model and realized I first need to check some assumptions.

  1. Autocorrelation not existing -> valid, since it is a between-subject experiment.
  2. Low collinearity between parameters -> checked and approved.
  3. Normality test -> checked and approved.

Now I am stuck on the residual plots.

As far as I understand I need them to check the linearity and additivity assumptions.
The prediction strength of my model is relatively low (also since there is only a small group of data sets, I guess, with high intra-individual differences).
But I do not completely understand their meaning:

Here is the residuals vs. fitted diagram:

enter image description here

And for the two predictors:

enter image description here

They seem just fine to me.
But there is another diagram that prints residuals in relation to the dependent variable:

enter image description here

There clearly is a pattern, which quite looks like a linear increase in error when the dependent variable is higher. What does that mean? Is it a good thing? Does that mean the assumption of linearity isn't met?

Also, is there something missing I need to check for the assumptions being met?

Best Answer

It usually means nothing -- and that's why we don't ordinarily look at this plot.

A regression model fits values $\hat y$ to responses $y.$ We can analyze the response into the sum of the fitted values and the residuals, $y = \hat y + (y - \hat y).$

When an ordinary least squares (OLS) model includes an intercept, the residuals are uncorrelated with the fitted values. (This statement is equivalent to the "Normal Equations" used to find the OLS solution.) Thus, in the standard diagnostic plot of the values $(\hat y, y - \hat y),$ we will see a pattern of zero correlation, exactly as in your first figure. Your question is about the closely related scatterplot of $(y, y - \hat y)$ where each residual is paired with its response rather than the fitted value.

An easy way to understand this situation is to recognize that the correlation remains the same when you switch coordinates.

Let's compare the switched-coordinate plots to each other. The usual one, when switched, shows (residual, fitted) = $(y-\hat y, \hat y)$ pairs. The lack of correlation means when you regress the fitted value against the residual, you get a zero slope: on average, $\hat y \approx \alpha_0 + 0(y - \hat y).$

The plot we are wondering about, with coordinates switched, shows (residual, response) pairs of

$$(y-\hat y, y) = (y-\hat y, \hat y + (y-\hat y)) \approx(y-\hat y, \alpha_0 + (y-\hat y)).$$

But points of the form $(x, \alpha_0 + x)$ (with $x = y-\hat y$) obviously lie on a line of slope $1.$ That's precisely what you see in the last plot: the regression of the response (horizontal axis) against the residual (vertical axis) must have a unit slope, not a zero slope. That's all that is going on here: your "linear increase in error" corresponds to this purely mathematical result rather than revealing anything about the data or the regression.

Related Question