Solved – Regression residual distribution assumptions

assumptionsnormal distributionnotationregressionresiduals

Why is it necessary to place the distributional assumption on the errors, i.e.

$y_i = X\beta + \epsilon_{i}$, with $\epsilon_{i} \sim \mathcal{N}(0,\sigma^{2})$.

Why not write

$y_i = X\beta + \epsilon_{i}$, with $y_i \sim \mathcal{N}(X\hat{\beta},\sigma^{2})$,

where in either case $\epsilon_i = y_i – \hat{y}$.
I've seen it stressed that the distributional assumptions are placed on the errors, not the data, but without explanation.

I'm not really understanding the difference between these two formulations. Some places I see distributional assumptions being placed on the data (Bayesian lit. it seems mostly), but most times the assumptions are placed on the errors.

When modelling, why would/should one choose to begin with assumptions on one or the other?

Best Answer

In a linear regression setting it is common to do analysis and derive results conditional on $X$, i.e. conditional on "the data". Thus, what you need is that $y\mid X $ is normal, that is, you need $\epsilon$ to be normal. As Peter Flom's example illustrates, one can have normality of $\epsilon$ without having normality of $y$, and, thus, since what you need is normality of $\epsilon$, that's the sensible assumption.

The residuals are our estimates of the error terms

The short answer to this question is relatively simple: the assumptions in a regression model are assumptions about the behaviour of the error terms, and the residuals are our estimates of the error terms. Ipso facto, examination of the behaviour of the observed residuals tells us whether or not the assumptions about the error terms are plausible.

To understand this general line of reasoning in more detail, it helps to examine in detail the behaviour of the residuals in a standard regression model. Under a standard multiple linear regression with independent homoskedastic normal error terms, the distribution of the residual vector is known, which allows you to test the underlying distributional assumptions in the regression model. The basic idea is that you figure out the distribution of the residual vector under the regression assumptions, and then check if the residual values plausibly match this theoretical distribution. Deviations from the theoretical residual distribution show that the underlying assumed distribution of the error terms is wrong in some respect, and fortunately it is possible to diagnose any flawed assumption from different kinds of departures from the theoretical distribution.

If you use the underlying error distribution $\epsilon_i \sim \text{IID N}(0, \sigma^2)$ for a standard regression model and you use OLS estimation for the coefficients, then the distribution of the residuals can be shown to be the multivariate normal distribution:

$$\boldsymbol{r} = (\boldsymbol{I} - \boldsymbol{h}) \boldsymbol{\epsilon} \sim \text{N}(\boldsymbol{0}, \sigma^2 (\boldsymbol{I} - \boldsymbol{h})),$$

where $\boldsymbol{h} = \boldsymbol{x} (\boldsymbol{x}^{\text{T}} \boldsymbol{x})^{-1} \boldsymbol{x}^{\text{T}}$ is the hat matrix for the regression. The residual vector mimics the error vector, but the variance matrix has the additional multiplicative term $\boldsymbol{I} - \boldsymbol{h}$. To test the regression assumptions we use the studentised residuals, which have marginal T-distribution:

$$s_i \equiv \frac{r_i}{\hat{\sigma}_{\text{Ext}} \cdot (1-l_i)} \sim \text{T}(\text{df}_{\text{Res}}-1).$$

(This formula is for the externally studentised residuals, where the variance estimator excludes the variable under consideration. The values $l_i = h_{i,i}$ are the leverage values, which are the diagonal values in the hat matrix. The studentised residuals are not independent, but if $n$ is large, they are close to independent. This means that the marginal distribution is a simple known distribution but the joint distribution is complicated.) Now, if the limit $\lim_{n \rightarrow \infty} (\boldsymbol{x}^{\text{T}} \boldsymbol{x}) / n = \Delta$ exists, then it can be shown that the coefficient estimators are consistent estimators of the true regression coefficients, and the residuals are consistent estimators of the true error terms.

Essentially, this means that you test the underlying distributional assumptions for the error terms by comparing the studentised residuals to the T-distribution. Each of the underlying properties of the error distribution (linearity, homoskedasticity, uncorrelated errors, normality) can be tested by using the analogous properties of the distribuion of the studentised residuals. If the model is correctly specified, then for large $n$ the residuals should be close to the true error terms, and they have a similar distributional form.

Omission of an explanatory variable from the regression model leads to omitted variable bias in the coefficient estimators and this affects the residual distribution. Both the mean and variance of the residual vector are affected by the omitted variable. If the omitted terms in the regression are $\boldsymbol{Z} \boldsymbol{\delta}$ then the residual vector becomes $\boldsymbol{r} = (\boldsymbol{I} - \boldsymbol{h}) (\boldsymbol{Z \delta} + \boldsymbol{\epsilon})$. If the data vectors in the omitted matrix $\boldsymbol{Z}$ are IID normal vectors and independent of the error terms then $\boldsymbol{Z \delta} + \boldsymbol{\epsilon} \sim \text{N} (\mu \boldsymbol{1}, \sigma_*^2 \boldsymbol{I})$ so that the residual distribution becomes:

$$\boldsymbol{r} = (\boldsymbol{I} - \boldsymbol{h}) (\boldsymbol{Z \delta} + \boldsymbol{\epsilon}) \sim \text{N} \Big( \mu (\boldsymbol{I} - \boldsymbol{h}) \boldsymbol{1}, \sigma_*^2 (\boldsymbol{I} - \boldsymbol{h}) \Big).$$

If there is already an intercept term in the model (i.e., if the unit vector $\boldsymbol{1}$ is in the design matrix) then $(\boldsymbol{I} - \boldsymbol{h}) \boldsymbol{1} = \boldsymbol{0}$, which means that the standard distributional form of the residuals is preserved. If there is no intercept term in the model then the omitted variable may give a non-zero mean for the residuals. Alternatively, if the omitted variable is not IID normal then it can lead to other deviations from the standard residual distribution. In this latter case, the residual tests are unlikely to detect anything resulting from the presence of an omitted variable; it is not usually possible to determine whether deviations from the theoretical residual distribution occurs as a result of an omitted variable, or merely because of an ill-posed relationship with the included variables (and arguably these are the same thing in any case).

Best Answer

Related Solutions

Regression – Understanding Why Residuals Are Used to Test Assumptions on Errors in Regression

The residuals are our estimates of the error terms

Related Question