Linear Regression Assumption: Normality of residual vs normality of variables

linear regressionnormal distribution

I have read in many places, including stack exchange, that in order to carry linear regression analysis the residuals have to be normal. This is required because most of the statistical results, parameter estimates, and prediction intervals rely on normality assumption. Thus, we carry appropriate tests to determine whether the residual is normal and, if the residual is not normal, we can either transform the observations data (or the residuals?) or we can take more data samples such that $N$$>$$30$. PLEASE CORRECT ME IF I'M WRONG ON ANY OF THIS.

However, I have also read in other places that for most models ,including linear regression, one also usually assumes normality of our variables for two main reasons(Toby Mordkoff, 2016):

1) "To prevent us from having to use one set of statistical procedures for large ($30+$) samples and another set of procedures for smaller samples… we assume that the population is normal"

2) "Therefore, if we are going to assume that our estimates of the population mean and variance are independent (in order to simplify the mathematics involved, as we do), and we are going to use the sample mean and the sample variance to make these estimates (as we do), then we need the sample mean and sample variance to be independent. The only distribution for which this is true is the normal. Therefore, we assume that populations are normal."

So should the independent (or dependent) variables in a linear regression model be normal or just the residual? Please explain.

Best Answer

Linear regression expresses a relationship between a response and covariates that is linear in terms of coefficients. In the simple case it associates one-dimensional response $Y$ with one-dimensional $X$ as follows.

$ Y = \beta_0 + \beta_1 X + \epsilon$,

where $Y, X$ and $\epsilon$ are considered as random variables and $\beta_0, \beta_1$ are coefficients (model parameters) to be estimated.

Being a regression to the mean, the model specifies:

$E[ Y|X ] = \beta_0 + \beta_1 X$ with an implied assumption that

$E[ \epsilon |X ] = 0$ and also $Var(\epsilon) =$ constant.

Thus, model restrictions are placed only on the conditional distribution of $\epsilon$ given $X$, or equivalently on $Y$ given $X$.

A convenient distribution used for residuals ($\epsilon$) is Normal/Gaussian, but the regression model, in general, works with other distributions as well.

Not to confuse things further here, but it should still be noted that the regression analysis doesn't have to make any distributional assumptions. In estimation of the coefficients, for example, we use least squares method with no mention of any distributions. However, for more complex analysis, statisticians use various probability distributions to specify models, make assumptions explicit and use probability theory to justify results.

To learn basics of statistical science I'd look at books written by statisticians. "Applied Linear Regression" by Weisberg, among many others, may be a good start and reference.

Related Question