Solved – Normally distributed errors – Why not use the observed residual histogram

econometricshistogramregression

In the context of the classical linear regression model (with all standard assumptions), we know that when the error term is normally distributed, least squares is minimum variance among all linear unbiased estimators and we know the exact statistical distribution of the t statistics.

However, if the residual histogram suggests that the error term is something other than normal, then the t statistics will not have the claimed t distribution (nor will the F statistic be F distributed). Surely, it does not make sense to assume normality if the residuals suggest otherwise. Instead, if we want to know the theoretical distribution of the t statistic, why not assume that the errors are distributed in population based on what we observe from the residual histogram?

Best Answer

The Central Limit Theorem applies in this case. If the residuals are not normally distributed, but the sample size is large enough, then the t statistics will be approximately t-distributed (and the F statistic will be approximately F distributed). How good the approximation is depends on how different the residuals are from the normal and how large the sample size is. Many regression problems have a combination that makes the approximation reasonable.

If there is a reason to believe a different distribution, then there are methods to fit regression models using that assumption. GLM models can fit binomial, poisson, and gamma distributed y's and using maximum likelihood or Bayesian methods (or others) can allow you to fit other distributions.

But if you are unwilling to assume normality, how can you be sure of other distributions? Sometimes it is clear, but if the residuals look like it might be a gamma, but you are not sure, then fitting based on a normal may be just as good (because of the CLT) as fitting to a gamma that does not actually fit.

If you don't want to make assumptions about the distribution of the residuals then there are options like permutation tests or bootstrapping (or other non-parametric regression tools), but all of these have their own sets of assumptions and conditions where they may work better or worse.

In the end it is important what question you are trying to answer and what you know about the science that produced the data that are the most important.