Solved – How to check for normality: raw data or residuals

assumptionsnormality-assumptionresiduals

I've learnt that I must test for normality not on the raw data but their residuals. Should I calculate residuals and then do the Shapiro–Wilk's W test?

Are residuals calculated as: $X_i – \text{mean}$ ?

Please see this previous question for my data and the design.

Best Answer

Why must you test for normality?

The standard assumption in linear regression is that the theoretical residuals are independent and normally distributed. The observed residuals are an estimate of the theoretical residuals, but are not independent (there are transforms on the residuals that remove some of the dependence, but still give only an approximation of the true residuals). So a test on the observed residuals does not guarantee that the theoretical residuals match.

If the theoretical residuals are not exactly normally distributed, but the sample size is large enough then the Central Limit Theorem says that the usual inference (tests and confidence intervals, but not necessarily prediction intervals) based on the assumption of normality will still be approximately correct.

Also note that the tests of normality are rule out tests, they can tell you that the data is unlikely to have come from a normal distribution. But if the test is not significant that does not mean that the data came from a normal distribution, it could also mean that you just don't have enough power to see the difference. Larger sample sizes give more power to detect the non-normality, but larger samples and the CLT mean that the non-normality is least important. So for small sample sizes the assumption of normality is important but the tests are meaningless, for large sample sizes the tests may be more accurate, but the question of exact normality becomes meaningless.

So combining all the above, what is more important than a test of exact normality is an understanding of the science behind the data to see if the population is close enough to normal. Graphs like qqplots can be good diagnostics, but understanding of the science is needed as well. If there is concern that there is too much skewness or potential for outliers, then non-parametric methods are available that do not require the normality assumption.