Regression – Why Expect Residuals in OLS Regression to Be Normally Distributed?

least squaresnormality-assumptionregressionresiduals

There are a lot of similar questions here but I have not found an answer to this specific question.

Source: for example in https://peopleanalytics-regression-book.org/linear-reg-ols.html#norm-dist-assum the author (a mathematician) says:

In an appropriate model we expect our errors to be random, so we
would therefore expect our residuals to be normally distributed over
sufficient numbers of observations.

The author then goes on to apply qqnorm(newmodel$residuals) to the data for diagnostics.

If you plot a model in R (plot(mymodel)), you get a bunch of diagnostic plots, the second of which is standardized residuals plotted against the theoretical quantiles – so essentially the same.

But why? What is the reasoning behind the residuals being normally distributed, and not just randomly, without having a recognised distribution at all, or some other distribution? Stats textbooks treat this as if it was obvious – could someone explain, please?

Best Answer

That author is writing nonsense. Just because errors are random doesn't mean that if you have a lot of them they will be normally distributed. It is absolutely not the case that OLS requires normally distributed residuals; its objective is "Least Squares", and minimizing the sum of squared deviations from the estimated values in no way requires any particular distribution for the residuals. See, for example, Regression when the OLS residuals are not normally distributed , and some of the associated questions linked to in comments.

It is true that if the underlying errors follow a Normal distribution, and are independent and identically distributed, then the OLS estimator is also the maximum likelihood estimator. But that in no way justifies the author's statement quoted in the original post.

Edit: For more, read @Glen_b's comment below.

Related Solutions

Residuals vs Y – What if Residuals are Normally Distributed but Y is Not?

It is reasonable for the residuals in a regression problem to be normally distributed, even though the response variable is not. Consider a univariate regression problem where $y \sim \mathcal{N}(\beta x, \sigma^2)$. so that the regression model is appropriate, and further assume that the true value of $\beta=1$. In this case, while the residuals of the true regression model are normal, the distribution of $y$ depends on the distribution of $x$, as the conditional mean of $y$ is a function of $x$. If the dataset has a lot of values of $x$ that are close to zero and progressively fewer the higher the value of $x$, then the distribution of $y$ will be skewed to the right. If values of $x$ are distributed symmetrically, then $y$ will be distributed symmetrically, and so forth. For a regression problem, we only assume that the response is normal conditioned on the value of $x$.

Regression – Handling OLS Residuals Not Normally Distributed

The ordinary least squares estimate is still a reasonable estimator in the face of non-normal errors. In particular, the Gauss-Markov Theorem states that the ordinary least squares estimate is the best linear unbiased estimator (BLUE) of the regression coefficients ('Best' meaning optimal in terms of minimizing mean squared error)as long as the errors

(1) have mean zero

(2) are uncorrelated

(3) have constant variance

Notice there is no condition of normality here (or even any condition that the errors are IID).

The normality condition comes into play when you're trying to get confidence intervals and/or $p$-values. As @MichaelChernick mentions (+1, btw) you can use robust inference when the errors are non-normal as long as the departure from normality can be handled by the method - for example, (as we discussed in this thread) the Huber $M$-estimator can provide robust inference when the true error distribution is the mixture between normal and a long tailed distribution (which your example looks like) but may not be helpful for other departures from normality. One interesting possibility that Michael alludes to is bootstrapping to obtain confidence intervals for the OLS estimates and seeing how this compares with the Huber-based inference.

Edit: I often hear it said that you can rely on the Central Limit Theorem to take care of non-normal errors - this is not always true (I'm not just talking about counterexamples where the theorem fails). In the real data example the OP refers to, we have a large sample size but can see evidence of a long-tailed error distribution - in situations where you have long tailed errors, you can't necessarily rely on the Central Limit Theorem to give you approximately unbiased inference for realistic finite sample sizes. For example, if the errors follow a $t$-distribution with $2.01$ degrees of freedom (which is not clearly more long-tailed than the errors seen in the OP's data), the coefficient estimates are asymptotically normally distributed, but it takes much longer to "kick in" than it does for other shorter-tailed distributions.

Below, I demonstrate with a crude simulation in R that when $y_{i} = 1 + 2x_{i} + \varepsilon_i$, where $\varepsilon_{i} \sim t_{2.01}$, the sampling distribution of $\hat{\beta}_{1}$ is still quite long tailed even when the sample size is $n=4000$:

set.seed(5678)
B = matrix(0,1000,2)
for(i in 1:1000)
{
    x = rnorm(4000) 
    y = 1 + 2*x + rt(4000,2.01)
    g = lm(y~x)
    B[i,] = coef(g)
}
qqnorm(B[,2])
qqline(B[,2])

enter image description here

Best Answer

Related Solutions

Residuals vs Y – What if Residuals are Normally Distributed but Y is Not?

Regression – Handling OLS Residuals Not Normally Distributed

Related Question