Linear Regression – Confirming the Distribution of Residuals in R

rregressionresiduals

Suppose we ran a simple linear regression $y=\beta_0+\beta_1x+u$, saved the residuals $\hat{u_i}$ and draw a histogram of distribution of residuals. If we get something which looks like a familiar distribution, can we assume that our error term has this distribution? Say, if we found out that residuals resemble normal distribution, does it make sense to assume normality of error term in population? I think it is sensible, but how can it be justified?

Best Answer

It all depends on how you estimate the parameters. Usually, the estimators are linear, which implies the residuals are linear functions of the data. When the errors $u_i$ have a Normal distribution, then so do the data, whence so do the residuals $\hat{u}_i$ ($i$ indexes the data cases, of course).

It's conceivable (and logically possible) that when the residuals appear to have approximately a Normal (univariate) distribution, that this arises from non-Normal distributions of errors. However, with least squares (or maximum likelihood) techniques of estimation, the linear transformation to compute the residuals is "mild" in the sense that the characteristic function of the (multivariate) distribution of the residuals cannot differ much from the cf of the errors.

In practice, we never need that the errors be exactly Normally distributed, so this is an unimportant issue. Of much greater import for the errors is that (1) their expectations should all be close to zero; (2) their correlations should be low; and (3) there should be an acceptably small number of outlying values. To check these, we apply various goodness-of-fit tests, correlation tests, and tests of outliers (respectively) to the residuals. Careful regression modeling always includes running such tests (which include various graphical visualizations of the residuals, such as supplied automatically by R's plot method when applied to an lm class).

Another way to get at this question is by simulating from the hypothesized model. Here is some (minimal, one-off) R code to do the job:

# Simulate y = b0 + b1*x + u and draw a normal probability plot of the residuals.
# (b0=1, b1=2, u ~ Normal(0,1) are hard-coded for this example.)
f<-function(n) { # n is the amount of data to simulate
    x <- 1:n; y <- 1 + 2*x + rnorm(n); 
    model<-lm(y ~ x); 
    lines(qnorm(((1:n) - 1/2)/n), y=sort(model$residuals), col="gray")
}
#
# Apply the simulation repeatedly to see what's happening in the long run.
#
n <- 6    # Specify the number of points to be in each simulated dataset
plot(qnorm(((1:n) - 1/2)/n), seq(from=-3,to=3, length.out=n), 
    type="n", xlab="x", ylab="Residual") # Create an empty plot
out <- replicate(99, f(n))               # Overlay lots of probability plots
abline(a=0, b=1, col="blue")             # Draw the reference line y=x

For the case n=32, this overlaid probability plot of 99 sets of residuals shows they tend to be close to the error distribution (which is standard normal), because they uniformly cleave to the reference line $y=x$:

Figure for n=32

For the case n=6, the smaller median slope in the probability plots hints that the residuals have a slightly smaller variance than the errors, but overall they tend to be normally distributed, because most of them track the reference line sufficiently well (given the small value of $n$):

Figure for n=6