Solved – Heteroskedasticity and residuals normality

heteroscedasticitynormal distributionregressionresiduals

I have a linear regression that's quite good, I guess (it's for a university project so I don't really have to be super accurate).

Point is, if I plot the residuals vs. predicted values, there is (according to my teacher) an hint of heteroskedasticity.

But if I plot the Q-Q-Plot of the residuals, it's clear that they are normally distributed.
Moreover, the Shapiro test on the residuals has a $p$-value of $0.8$, so I think there's no doubt the residuals are actually normally distributed.

Question: How can there be heteroskedasticity on predicted values if the residuals are normally distributed?

Best Answer

One way to approach this question is to look at it in reverse: how could we begin with normally distributed residuals and arrange them to be heteroscedastic? From this point of view the answer becomes obvious: associate the smaller residuals with the smaller predicted values.

To illustrate, here is an explicit construction.

Figure

The data at the left are clearly heteroscedastic relative to the linear fit (shown in red). This is driven home by the residuals vs predicted plot at the right. But--by construction--the unordered set of residuals is close to normally distributed, as their histogram in the middle shows. (The p-value in the Shapiro-Wilk test of normality is 0.60, obtained with the R command shapiro.test(residuals(fit)) issued after running the code below.)

Real data can look like this, too. The moral is that heteroscedasticity characterizes a relationship between residual size and predictions whereas normality tells us nothing about how the residuals relate to anything else.


Here is the R code for this construction.

set.seed(17)
n <- 256
x <- (1:n)/n                       # The set of x values
e <- rnorm(n, sd=1)                # A set of *normally distributed* values
i <- order(runif(n, max=dnorm(e))) # Put the larger ones towards the end on average
y <- 1 + 5 * x + e[rev(i)]         # Generate some y values plus "error" `e`.
fit <- lm(y ~ x)                   # Regress `y` against `x`.
par(mfrow=c(1,3))                  # Set up the plots ...
plot(x,y, main="Data", cex=0.8)
abline(coef(fit), col="Red")
hist(residuals(fit), main="Residuals")
plot(predict(fit), residuals(fit), cex=0.8, main="Residuals vs. Predicted")