I am interested to know why residual plots are plotted with residuals against predicted variable of y and not against y?
Solved – Why are residual plots constructed using the residuals vs the predicted values
data visualizationregressionresiduals
Related Solutions
Judging from your diagnostic plots I wouldn't worry about heteroscedasticity and use this model.
However, you can test if modelling the variance of residuals would improve the model:
library(nlme)
fit1 <- gls(value ~ wd, data=df)
fit2 <- gls(value ~ wd, data=df, weights = varIdent(form = ~ 1 | wd))
anova(fit1, fit2)
# Model df AIC BIC logLik Test L.Ratio p-value
#fit1 1 8 5332.321 5364.509 -2658.160
#fit2 2 14 5286.911 5343.239 -2629.456 1 vs 2 57.41009 <.0001
As you see, the model, which includes variance parameters, seems slightly better. However, the standard errors or t-values are only slightly different from your original model and you wouldn't infer different conclusions:
summary(fit2)
#Generalized least squares fit by REML
# Model: value ~ wd
# Data: df
# AIC BIC logLik
# 5286.911 5343.239 -2629.455
#
#Variance function:
# Structure: Different standard deviations per stratum
# Formula: ~1 | wd
# Parameter estimates:
#Wednesday Thursday Friday Saturday Sunday Monday Tuesday
#1.0000000 1.2789517 1.2876130 0.9901914 0.5975224 1.1128592 1.5904740
#
#Coefficients:
# Value Std.Error t-value p-value
#(Intercept) 4242.602 20.92298 202.77237 0
#wdMonday 148.931 27.65462 5.38538 0
#wdSaturday -2661.925 26.39433 -100.85215 0
#wdSunday -3113.783 23.06607 -134.99409 0
#wdThursday 299.647 29.49021 10.16091 0
#wdTuesday 189.044 33.25205 5.68520 0
#wdWednesday 412.524 26.49180 15.57175 0
anova(fit)
#Analysis of Variance Table
#
#Response: value
# Df Sum Sq Mean Sq F value Pr(>F)
#wd 6 834554455 139092409 6538.6 < 2.2e-16 ***
#Residuals 413 8785585 21273
anova(fit2)
#Denom. DF: 413
# numDF F-value p-value
#(Intercept) 1 210610.5 <.0001
#wd 6 11982.3 <.0001
You could model the weekdays as an ordered factor, which would cause the regression functions to use polynomial contrasts.
By construction the error term in an OLS model is uncorrelated with the observed values of the X covariates. This will always be true for the observed data even if the model is yielding biased estimates that do not reflect the true values of a parameter because an assumption of the model is violated (like an omitted variable problem or a problem with reverse causality). The predicted values are entirely a function of these covariates so they are also uncorrelated with the error term. Thus, when you plot residuals against predicted values they should always look random because they are indeed uncorrelated by construction of the estimator. In contrast, it's entirely possible (and indeed probable) for a model's error term to be correlated with Y in practice. For example, with a dichotomous X variable the further the true Y is from either E(Y | X = 1)
or E(Y | X = 0)
then the larger the residual will be. Here is the same intuition with simulated data in R where we know the model is unbiased because we control the data generating process:
rm(list=ls())
set.seed(21391209)
trueSd <- 10
trueA <- 5
trueB <- as.matrix(c(3,5,-1,0))
sampleSize <- 100
# create independent x-values
x1 <- rnorm(n=sampleSize, mean = 0, sd = 4)
x2 <- rnorm(n=sampleSize, mean = 5, sd = 10)
x3 <- 3 + x1 * 4 + x2 * 2 + rnorm(n=sampleSize, mean = 0, sd = 10)
x4 <- -50 + x1 * 7 + x2 * .5 + x3 * 2 + rnorm(n=sampleSize, mean = 0, sd = 20)
X = as.matrix(cbind(x1,x2,x3,x4))
# create dependent values according to a + bx + N(0,sd)
Y <- trueA + X %*% trueB +rnorm(n=sampleSize,mean=0,sd=trueSd)
df = as.data.frame(cbind(Y,X))
colnames(df) <- c("y", "x1", "x2", "x3", "x4")
ols = lm(y~x1+x2+x3+x4, data = df)
y_hat = predict(ols, df)
error = Y - y_hat
cor(y_hat, error) #Zero
cor(Y, error) #Not Zero
We get the same result of zero correlation with a biased model, for example if we omit x1.
ols2 = lm(y~x2+x3+x4, data = df)
y_hat2 = predict(ols2, df)
error2 = Y - y_hat2
cor(y_hat2, error2) #Still zero
cor(Y, error2) #Not Zero
Best Answer
The standard (OLS) linear regression model is:
$$ Y = \beta_0 + \beta_1X + \varepsilon \\ \text{where }\ \varepsilon \sim \mathcal N(0, \sigma^2) $$ The important thing to recognize here is that the error term is normally distributed with variance that does not depend on $X$. Since $\hat Y = \beta_0 + \beta_1X$, the residuals1 of our model can be used as estimates of the errors of the data generating process, and we can inspect the plot of the residuals vs. the fitted values to assess the assumption of constant variance (homoscedasticity). To understand this more fully, it may help to read my answer here: What does having “constant variance” in a linear regression model mean? On the other hand, it is not clear what a plot of the residuals vs. the raw $Y$ values would illustrate. In fact, we generally expect some degree of correlation between the residuals and $Y$. (It may help to read this excellent CV thread: What is the expected correlation between residual and the dependent variable?)
In addition, the plot of residuals vs fitted values can be used to help identify a misspecified functional form2. Again, since we expect the residuals and $Y$ to be correlated, the plot of the residuals vs. the raw $Y$ values will be misleading on this issue.
Using the code and data from my answer linked above, consider these four plots:
Neither model is misspecified, but the model represented in the right two plots does have heteroscedasticity. The top plots help you identify the clear heteroscedasticity on the right, without leading you to worry about a possible misspecified functional form. The plots on the bottom incorrectly connote misspecification, and do so more strongly than they inform us of the status of the constant variance assumption.
1. We actually use the standardized residuals here.
2. This becomes more difficult with increasing numbers of $X$ variables, though.