Solved – Why are residual plots constructed using the residuals vs the predicted values

data visualizationregressionresiduals

I am interested to know why residual plots are plotted with residuals against predicted variable of y and not against y?

Best Answer

The standard (OLS) linear regression model is:
$$ Y = \beta_0 + \beta_1X + \varepsilon \\ \text{where }\ \varepsilon \sim \mathcal N(0, \sigma^2) $$ The important thing to recognize here is that the error term is normally distributed with variance that does not depend on $X$. Since $\hat Y = \beta_0 + \beta_1X$, the residuals¹ of our model can be used as estimates of the errors of the data generating process, and we can inspect the plot of the residuals vs. the fitted values to assess the assumption of constant variance (homoscedasticity). To understand this more fully, it may help to read my answer here: What does having “constant variance” in a linear regression model mean? On the other hand, it is not clear what a plot of the residuals vs. the raw $Y$ values would illustrate. In fact, we generally expect some degree of correlation between the residuals and $Y$. (It may help to read this excellent CV thread: What is the expected correlation between residual and the dependent variable?)

In addition, the plot of residuals vs fitted values can be used to help identify a misspecified functional form². Again, since we expect the residuals and $Y$ to be correlated, the plot of the residuals vs. the raw $Y$ values will be misleading on this issue.

Using the code and data from my answer linked above, consider these four plots:

enter image description here

Neither model is misspecified, but the model represented in the right two plots does have heteroscedasticity. The top plots help you identify the clear heteroscedasticity on the right, without leading you to worry about a possible misspecified functional form. The plots on the bottom incorrectly connote misspecification, and do so more strongly than they inform us of the status of the constant variance assumption.

_{1. We actually use the standardized residuals here.}
_{2. This becomes more difficult with increasing numbers of $X$ variables, though.}

Related Solutions

R – How to Interpret Linear Regression Residual Plots Using R

Judging from your diagnostic plots I wouldn't worry about heteroscedasticity and use this model.

However, you can test if modelling the variance of residuals would improve the model:

library(nlme)
fit1 <- gls(value ~ wd, data=df)

fit2 <- gls(value ~ wd, data=df, weights = varIdent(form = ~ 1 | wd))


anova(fit1, fit2)

#     Model df      AIC      BIC    logLik   Test  L.Ratio p-value
#fit1     1  8 5332.321 5364.509 -2658.160                        
#fit2     2 14 5286.911 5343.239 -2629.456 1 vs 2 57.41009  <.0001

As you see, the model, which includes variance parameters, seems slightly better. However, the standard errors or t-values are only slightly different from your original model and you wouldn't infer different conclusions:

summary(fit2)
#Generalized least squares fit by REML
#  Model: value ~ wd 
#  Data: df 
#       AIC      BIC    logLik
#  5286.911 5343.239 -2629.455
#
#Variance function:
# Structure: Different standard deviations per stratum
# Formula: ~1 | wd 
# Parameter estimates:
#Wednesday  Thursday    Friday  Saturday    Sunday    Monday   Tuesday 
#1.0000000 1.2789517 1.2876130 0.9901914 0.5975224 1.1128592 1.5904740 
#
#Coefficients:
#                Value Std.Error    t-value p-value
#(Intercept)  4242.602  20.92298  202.77237       0
#wdMonday      148.931  27.65462    5.38538       0
#wdSaturday  -2661.925  26.39433 -100.85215       0
#wdSunday    -3113.783  23.06607 -134.99409       0
#wdThursday    299.647  29.49021   10.16091       0
#wdTuesday     189.044  33.25205    5.68520       0
#wdWednesday   412.524  26.49180   15.57175       0

anova(fit)
#Analysis of Variance Table
#
#Response: value
#           Df    Sum Sq   Mean Sq F value    Pr(>F)    
#wd          6 834554455 139092409  6538.6 < 2.2e-16 ***
#Residuals 413   8785585     21273 

anova(fit2)
#Denom. DF: 413 
#            numDF  F-value p-value
#(Intercept)     1 210610.5  <.0001
#wd              6  11982.3  <.0001

You could model the weekdays as an ordered factor, which would cause the regression functions to use polynomial contrasts.

Residual Plots – Why Plot Versus Fitted Values Not Observed Y Values?

By construction the error term in an OLS model is uncorrelated with the observed values of the X covariates. This will always be true for the observed data even if the model is yielding biased estimates that do not reflect the true values of a parameter because an assumption of the model is violated (like an omitted variable problem or a problem with reverse causality). The predicted values are entirely a function of these covariates so they are also uncorrelated with the error term. Thus, when you plot residuals against predicted values they should always look random because they are indeed uncorrelated by construction of the estimator. In contrast, it's entirely possible (and indeed probable) for a model's error term to be correlated with Y in practice. For example, with a dichotomous X variable the further the true Y is from either E(Y | X = 1) or E(Y | X = 0) then the larger the residual will be. Here is the same intuition with simulated data in R where we know the model is unbiased because we control the data generating process:

rm(list=ls())
set.seed(21391209)

trueSd <- 10
trueA <- 5
trueB <- as.matrix(c(3,5,-1,0))
sampleSize <- 100

# create independent x-values
x1 <- rnorm(n=sampleSize, mean = 0, sd = 4)
x2 <-  rnorm(n=sampleSize, mean = 5, sd = 10)
x3 <- 3 + x1 * 4 + x2 * 2 + rnorm(n=sampleSize, mean = 0, sd = 10)
x4 <- -50 + x1 * 7 + x2 * .5 + x3 * 2  + rnorm(n=sampleSize, mean = 0, sd = 20)
X = as.matrix(cbind(x1,x2,x3,x4))


# create dependent values according to a + bx + N(0,sd)
Y <-  trueA +  X %*%  trueB  +rnorm(n=sampleSize,mean=0,sd=trueSd)


df = as.data.frame(cbind(Y,X))
colnames(df) <- c("y", "x1", "x2", "x3", "x4")
ols = lm(y~x1+x2+x3+x4, data = df)
y_hat = predict(ols, df)
error = Y - y_hat
cor(y_hat, error) #Zero
cor(Y, error) #Not Zero

We get the same result of zero correlation with a biased model, for example if we omit x1.

ols2 = lm(y~x2+x3+x4, data = df)
y_hat2 = predict(ols2, df)
error2 = Y - y_hat2
cor(y_hat2, error2) #Still zero
cor(Y, error2) #Not Zero

Best Answer

Related Solutions

R – How to Interpret Linear Regression Residual Plots Using R

Residual Plots – Why Plot Versus Fitted Values Not Observed Y Values?

Related Question