Regression – Why Expect Residuals in OLS Regression to Be Normally Distributed?

least squaresnormality-assumptionregressionresiduals

There are a lot of similar questions here but I have not found an answer to this specific question.

Source: for example in https://peopleanalytics-regression-book.org/linear-reg-ols.html#norm-dist-assum the author (a mathematician) says:

In an appropriate model we expect our errors to be random, so we
would therefore expect our residuals to be normally distributed over
sufficient numbers of observations.

The author then goes on to apply qqnorm(newmodel$residuals) to the data for diagnostics.

If you plot a model in R (plot(mymodel)), you get a bunch of diagnostic plots, the second of which is standardized residuals plotted against the theoretical quantiles – so essentially the same.

But why? What is the reasoning behind the residuals being normally distributed, and not just randomly, without having a recognised distribution at all, or some other distribution? Stats textbooks treat this as if it was obvious – could someone explain, please?

Best Answer

That author is writing nonsense. Just because errors are random doesn't mean that if you have a lot of them they will be normally distributed. It is absolutely not the case that OLS requires normally distributed residuals; its objective is "Least Squares", and minimizing the sum of squared deviations from the estimated values in no way requires any particular distribution for the residuals. See, for example, Regression when the OLS residuals are not normally distributed , and some of the associated questions linked to in comments.

It is true that if the underlying errors follow a Normal distribution, and are independent and identically distributed, then the OLS estimator is also the maximum likelihood estimator. But that in no way justifies the author's statement quoted in the original post.

Edit: For more, read @Glen_b's comment below.