Linear Regression – Why Care About Normally Distributed Error Terms and Homoskedasticity in Linear Regression

assumptionsnormality-assumptionregressionrobustteaching

I suppose I get frustrated every time I hear someone say that non-normality of residuals and /or heteroskedasticity violates OLS assumptions. To estimate parameters in an OLS model neither of these assumptions are necessary by the Gauss-Markov theorem. I see how this matters in Hypothesis Testing for the OLS model, because assuming these things give us neat formulas for t-tests, F-tests, and more general Wald statistics.

But it is not too hard to do hypothesis testing without them. If we drop just homoskedasticity we can calculate robust standard errors and clustered standard errors easily. If we drop normality altogether we can use bootstrapping and, given another parametric specification for the error terms, likelihood ratio, and Lagrange multiplier tests.

It's just a shame that we teach it this way, because I see a lot of people struggling with assumptions they do not have to meet in the first place.

Why is it that we stress these assumptions so heavily when we have the ability to easily apply more robust techniques? Am I missing something important?

Best Answer

In Econometrics, we would say that non-normality violates the conditions of the Classical Normal Linear Regression Model, while heteroskedasticity violates both the assumptions of the CNLR and of the Classical Linear Regression Model.

But those that say "...violates OLS" are also justified: the name Ordinary Least-Squares comes from Gauss directly and essentially refers to normal errors. In other words "OLS" is not an acronym for least-squares estimation (which is a much more general principle and approach), but of the CNLR.

Ok, this was history, terminology and semantics. I understand the core of the OP's question as follows: "Why should we emphasize the ideal, if we have found solutions for the case when it is not present?" (Because the CNLR assumptions are ideal, in the sense that they provide excellent least-square estimator properties "off-the-shelf", and without the need to resort to asymptotic results. Remember also that OLS is maximum likelihood when the errors are normal).

As an ideal, it is a good place to start teaching. This is what we always do in teaching any kind of subject: "simple" situations are "ideal" situations, free of the complexities one will actually encounter in real life and real research, and for which no definite solutions exist.

And this is what I find problematic about the OP's post: he writes about robust standard errors and bootstrap as though they are "superior alternatives", or foolproof solutions to the lack of the said assumptions under discussion for which moreover the OP writes

"..assumptions that people do not have to meet"

Why? Because there are some methods of dealing with the situation, methods that have some validity of course, but they are far from ideal? Bootstrap and heteroskedasticity-robust standard errors are not the solutions -if they indeed were, they would have become the dominant paradigm, sending the CLR and the CNLR to the history books. But they are not.

So we start from the set of assumptions that guarantees those estimator properties that we have deemed important (it is another discussion whether the properties designated as desirable are indeed the ones that should be), so that we keep visible that any violation of them, has consequences which cannot be fully offset through the methods we have found in order to deal with the absence of these assumptions. It would be really dangerous, scientifically speaking, to convey the feeling that "we can bootstrap our way to the truth of the matter" -because, simply, we cannot.

So, they remain imperfect solutions to a problem, not an alternative and/or definitely superior way to do things. Therefore, we have first to teach the problem-free situation, then point to the possible problems, and then discuss possible solutions. Otherwise, we would elevate these solutions to a status they don't really have.