Solved – Heavy-tailed residuals for OLS regression with large n. Implications

least squaresmultiple regressionqq-plotregressionresiduals

I am trying to fit a multiple regression on a dataset with n=8619.

First of all, using an untransformed Y as the response variable (ie Y = aX + bX +..) resulted in a residual plot with increasing error variance.

I then tried transforming Y to sqrt(Y) which made the residual plot look better.

However, the residuals still exhibit a wide-tailed distribution (see QQ plot below).

My question is – to what extent does this affect the validity of the model? I am aware that non-normal residuals and variances will result in inaccurate p-values/standard errors, but if I recall correctly, the inaccuracy is much more pronounced with smaller samples.

With my sample size (n=8619), is it large enough to be resistant to such a wide-tailed residual distribution?

Thanks.

plots

Best Answer

OLS doesn't require normal errors to estimate the coefficients, as you noted. In large samples you can apply CLT (central limit theorem) to obtain the p-values.

The problem with fat tails is that they may be coming from a distribution which will not let you apply CLT. For instance, there's a family of distributions called stable. Usually, to apply CLT when you add random variables their sum converges to normal distribution. The stable random variables add up to a stable distribution regardless of the sample size, whether it's 30 or 8,000. They have other nasty properties, e.g. some of these distributions do not have mean or variance, which will make coefficient variance-covariance calculation "interesting".

So, unfortunately, with heavy tails I can't tell not to worry because your sample is large. You should look into your errors closer in this case.