Suppose we ran a simple linear regression $y=\beta_0+\beta_1x+u$, saved the residuals $\hat{u_i}$ and draw a histogram of distribution of residuals. If we get something which looks like a familiar distribution, can we assume that our error term has this distribution? Say, if we found out that residuals resemble normal distribution, does it make sense to assume normality of error term in population? I think it is sensible, but how can it be justified?
Linear Regression – Confirming the Distribution of Residuals in R
rregressionresiduals
Related Solutions
The Central Limit Theorem applies in this case. If the residuals are not normally distributed, but the sample size is large enough, then the t statistics will be approximately t-distributed (and the F statistic will be approximately F distributed). How good the approximation is depends on how different the residuals are from the normal and how large the sample size is. Many regression problems have a combination that makes the approximation reasonable.
If there is a reason to believe a different distribution, then there are methods to fit regression models using that assumption. GLM models can fit binomial, poisson, and gamma distributed y's and using maximum likelihood or Bayesian methods (or others) can allow you to fit other distributions.
But if you are unwilling to assume normality, how can you be sure of other distributions? Sometimes it is clear, but if the residuals look like it might be a gamma, but you are not sure, then fitting based on a normal may be just as good (because of the CLT) as fitting to a gamma that does not actually fit.
If you don't want to make assumptions about the distribution of the residuals then there are options like permutation tests or bootstrapping (or other non-parametric regression tools), but all of these have their own sets of assumptions and conditions where they may work better or worse.
In the end it is important what question you are trying to answer and what you know about the science that produced the data that are the most important.
it seems that you're confused about relation of the sample size to CLT application. the distribution of $\epsilon_{it}$ has nothing to do with the sample size. I'm assuming that subscript $i$ refers to the subject (a person), and a subscript $t$ refers to the tume of othe observation.
in a simple linear regression we don't make a lot of assumptions about $\epsilon$ to estimate $\beta_i$. the errors don't have to be normal, and with increasing sample size they will not tend to become normal.
CLT is applied in two different ways:
- when a sample size increases then the distribution of an estimate of $\beta_i$ which is often denoted as $\hat{\beta}_i$ will tend to become normal, i.e. $\hat{\beta}_i\sim\mathcal{N}(0,\sigma_\beta)$, where $\sigma_\beta$ is a function of $\sigma$. Again, we do not require $\epsilon_{it}\sim\mathcal{N}(0,\sigma)$, we only need $var[\epsilon_{it}]=\sigma$ for this. This is one of large sample properties of linear regressions.
- often times when we deal with physical experiments, one could argue that there are many sources of errors, when they all add up, they make $\epsilon_{it}$ - a single observation noise - distributed normally. this is not related to the sample size, this is simply sources of errors influencing a single observation. in this case we often make a reasonable assumption of $\epsilon_{it}\sim\mathcal{N}(0,\sigma)$
Best Answer
It all depends on how you estimate the parameters. Usually, the estimators are linear, which implies the residuals are linear functions of the data. When the errors $u_i$ have a Normal distribution, then so do the data, whence so do the residuals $\hat{u}_i$ ($i$ indexes the data cases, of course).
It's conceivable (and logically possible) that when the residuals appear to have approximately a Normal (univariate) distribution, that this arises from non-Normal distributions of errors. However, with least squares (or maximum likelihood) techniques of estimation, the linear transformation to compute the residuals is "mild" in the sense that the characteristic function of the (multivariate) distribution of the residuals cannot differ much from the cf of the errors.
In practice, we never need that the errors be exactly Normally distributed, so this is an unimportant issue. Of much greater import for the errors is that (1) their expectations should all be close to zero; (2) their correlations should be low; and (3) there should be an acceptably small number of outlying values. To check these, we apply various goodness-of-fit tests, correlation tests, and tests of outliers (respectively) to the residuals. Careful regression modeling always includes running such tests (which include various graphical visualizations of the residuals, such as supplied automatically by R's
plot
method when applied to anlm
class).Another way to get at this question is by simulating from the hypothesized model. Here is some (minimal, one-off)
R
code to do the job:For the case n=32, this overlaid probability plot of 99 sets of residuals shows they tend to be close to the error distribution (which is standard normal), because they uniformly cleave to the reference line $y=x$:
For the case n=6, the smaller median slope in the probability plots hints that the residuals have a slightly smaller variance than the errors, but overall they tend to be normally distributed, because most of them track the reference line sufficiently well (given the small value of $n$):