Solved – Regression: why test normality of overall residuals, instead of residuals conditional on $\hat{y}$

assumptionsregression

I understand that in linear regression the errors are assumed to be normally distributed, conditional on the predicted value of $y$. Then we look at the residuals as a kind of proxy for the errors.

It's often recommended to generate output like this: . However, I don't understand what the point is of obtaining the residual for each data point and mashing that together in a single plot.

I understand that we are unlikely to have sufficient data points to properly assess whether we have normal residuals at each predicted value of $y$.

However, isn't the question of whether we have normal residuals overall a separate one, and one that doesn't clearly relate to the model assumption of normal residuals at each predicted value of $y$? Couldn't we have normal residuals at each predicted value of $y$, while having overall residuals that were quite non-normal?

Best Answer

Couldn't we have normal residuals at each predicted value of y, while having overall residuals that were quite non-normal?

No -- at least, not under the standard assumption that the variance of the errors is constant.

You can think of the distribution of overall residuals as a mixture of normal distributions (one for each level of $\hat{y}$). By assumption, all of these normal distributions have the same mean (0) and the same variance. Thus, the distribution of this mixture of normals is itself simply a normal distribution.

So from this we can form a little syllogism based on modus tollens: if P then Q; not Q; therefore not P. In this case we have: If the individual distributions given the values of the predictor X are normal (and their variances are equal), then the distribution of the overall residuals is normal. So if we observe that the distribution of overall residuals is apparently not normal, this implies that the distributions given X are not normal with equal variance. Which is a violation of the standard assumptions.

@BigBendRegion points out something in the comments that I think is worth adding to this answer for emphasis. The line of argument I outlined above works for refuting normality, but it cannot be used to confirm normality. That is, if we check the marginal distribution of residuals and see that it does appear normal, this does NOT entail that the residuals conditional on X are normal (see HERE for counterexamples). In terms of the P and Q statements above, observing that Q is true does not entail that P is true. That would be affirming the consequent.

Related Solutions

Solved – What to do when Kolmogorov-Smirnov test is significant for residuals of parametric test but skewness and kurtosis look normal

In order to make sure that I can use parametric test, I need to make sure that my residual distribution is normal.

There is really no way to demonstrate that you have exact normality, but that's okay because approximate normality will generally be sufficient for hypothesis tests in regression to work the way you want.

However, when I refer to the value of skewness and kurtosis of the residual, it is -0.017 and -0.438 respectively, where i think this is considered as normal.

You can obtain values like that with residuals from a simple regression on normal data, but the kurtosis is just significant at the 5% level.

(Technical aside: I used simulation to assess the significance of the kurtosis of residuals here; not knowing the number of predictors, I did it for both independent normals and for one predictor at the given sample size, both showed essentially the same p-value; results should be similar for regression with small numbers of predictors.)

This doesn't actually suggest a problem with the inference when doing a regression or correlation, however. Your data won't be exactly normal; the essential question is 'are the data so badly non-normal that the inference no longer has the properties you wish?'

Unfortunately, when i do kolmogorov-smirnov, the significant value is 0.021, which indicates the residual is not normal.

What were the specified population mean and variance of the residuals for your KS test and how did you get such population values?

Could anybody please explain to me what to do.

I suggest you don't do a hypothesis test to assess the suitability of the assumption of normality, but instead to look at diagnostic displays that show you how badly non-normal the data are.

Some pointers -

See the points here

Also see the discussion on this question

See the comments under this answer, and the advice in this answer

Consider this advice

Solved – Heteroskedasticity and residuals normality

One way to approach this question is to look at it in reverse: how could we begin with normally distributed residuals and arrange them to be heteroscedastic? From this point of view the answer becomes obvious: associate the smaller residuals with the smaller predicted values.

To illustrate, here is an explicit construction.

The data at the left are clearly heteroscedastic relative to the linear fit (shown in red). This is driven home by the residuals vs predicted plot at the right. But--by construction--the unordered set of residuals is close to normally distributed, as their histogram in the middle shows. (The p-value in the Shapiro-Wilk test of normality is 0.60, obtained with the R command shapiro.test(residuals(fit)) issued after running the code below.)

Real data can look like this, too. The moral is that heteroscedasticity characterizes a relationship between residual size and predictions whereas normality tells us nothing about how the residuals relate to anything else.

Here is the R code for this construction.

set.seed(17)
n <- 256
x <- (1:n)/n                       # The set of x values
e <- rnorm(n, sd=1)                # A set of *normally distributed* values
i <- order(runif(n, max=dnorm(e))) # Put the larger ones towards the end on average
y <- 1 + 5 * x + e[rev(i)]         # Generate some y values plus "error" `e`.
fit <- lm(y ~ x)                   # Regress `y` against `x`.
par(mfrow=c(1,3))                  # Set up the plots ...
plot(x,y, main="Data", cex=0.8)
abline(coef(fit), col="Red")
hist(residuals(fit), main="Residuals")
plot(predict(fit), residuals(fit), cex=0.8, main="Residuals vs. Predicted")

Best Answer

Related Solutions

Solved – What to do when Kolmogorov-Smirnov test is significant for residuals of parametric test but skewness and kurtosis look normal

Solved – Heteroskedasticity and residuals normality

Related Question