Solved – Residuals in regression should not be correlated with another variable

correlationmultiple regressionresiduals

This minitab post on checking residuals says:

If you can predict the residuals with another variable, that variable
should be included in the model.

I would think if we can predict residuals with another variable, then I would not include that variable in the model. It is not clear to me why including the variable makes sense.

Best Answer

This is based on the approach that there exists a set of explanatory variables (EV) whose variability captures everything in the variability of the dependent variable bar "random, unpredictable noise". As the link itself says clearly,

"The idea is that the deterministic portion of your model is so good at explaining (or predicting) the response that only the inherent randomness of any real-world phenomenon remains leftover for the error portion."

So, if the residuals $\hat {\mathbf u}$ from the regression of say, DV $y$ on EVs $\{X_1, X_2\}$ are correlated with some third variable $Z$ (which was not included in the regression), it means that the residuals do not behave like "random, unpredictable noise". So the set of EVs that was used, is not that set which "captures everything in the variability of $y$", and so it can be "improved upon".

Let's see what this means in the merciless language of mathematics.

We have the model (matrix notation) (and a sample of size $n$)

$$\mathbf y = \mathbf X\beta + \mathbf u$$

and let's assume that the nice properties for the OLS estimator to triumph, do hold:

$$Ε(\mathbf u \mid \mathbf X) = 0, {\rm Var}(\mathbf u \mid \mathbf X) = \sigma^2 I$$

Note that "the error is random, unpredictable noise" is not part of these assumptions. What the above says is that the error is unpredictable with respect to the specific set of regressors (check out the "Conditional Expectation Function" approach to linear regression). Running the estimation we obtain the OLS estimator

$$\hat \beta = (\mathbf X'\mathbf X)^{-1}\mathbf X'\mathbf y = \beta + (\mathbf X'\mathbf X)^{-1}\mathbf X'\mathbf u$$

and from it the residuals. The residuals are an estimator of the error, and, since

$$u_i = y_i - \mathbf x'_i\beta = \mathbf x'_i \hat \beta +\hat u_i - \mathbf x'_i\beta$$

we can re-arrange to get

$$\hat u_i = u_i - \mathbf x'_i(\hat \beta - \beta),\;\; i=1,...,n$$

So (and since moreover the residuals by construction have zero mean), being correlated with some variable $Z$, which is not in $\mathbf X$, it means,

$${\rm Cov}(\hat u_i, z_i) = E\Big(z_i\cdot [u_i - \mathbf x'_i(\hat \beta - \beta)]\Big) = E(u_iz_i) - E\Big(z_i\cdot\mathbf x'_i(\mathbf X'\mathbf X)^{-1}\mathbf X'\mathbf u\Big) \neq 0$$

Now, apply the Law of Iterated Expectations on the second term of the last expression to get

$$E(u_iz_i) - E\Big[E\Big(z\cdot\mathbf x'_i(\mathbf X'\mathbf X)^{-1}\mathbf X'\mathbf u\mid Z, \mathbf X\Big)\Big] \neq 0$$

$$\implies E(u_iz_i) - E\Big[z\cdot\mathbf x'_i(\mathbf X'\mathbf X)^{-1}\mathbf X'E\Big(\mathbf u\mid Z, \mathbf X\Big)\Big] \neq 0$$

The possible scenarios here are:

A) $E(u_iz_i) = 0, \;\;\; E\Big(\mathbf u\mid Z, \mathbf X\Big) \neq 0$ In this case, if you follow the advice given, and re-run the regression of $\mathbf y$ using $\{Z, \mathbf X\}$ as regressors this time, you should know that, in the attempt to make the residuals behave as "random noise" with respect to other variables, not in $\mathbf X$ (which is far more ambitious than what we usually assume for our models), you will lose the finite-sample unbiasedness property of the estimator (due to $E\Big(\mathbf u\mid Z, \mathbf X)\Big) \neq 0$), but at least you will retain asymptotic consistency (due to $E(u_iz_i) = 0$, and under the assumptions already made). That's a good trade-off, Econometrics has long abandoned the hopes for finite-sample unbiasedness (and if you look around this site, you will find out that Statisticians in general have adopted, or pioneered, the same stance).

B) $E(u_iz_i) \neq 0, \;\;\; E\Big(\mathbf u\mid Z, \mathbf X\Big) \neq 0$

Here, you will lose both unbiasedness and consistency, and your estimator starts looking rather weak in good properties, to put it mildly, and I am not sure that "making the residuals random" justifies the price to pay. To my eyes, it doesn't because the estimates for $\beta$ are now very unreliable, and so the "random" residuals achievement becomes an artificial construct, and not a step closer to the true associations. And what if there is yet another variable $W$ which still can predict the new residuals obtained?

So the advice given may send you to better places, may send you to much worse places. Therefore, the critical issue is : Can you obtain evidence related to which of the two scenarios holds in a given case? - but this exceeds the scope of this answer.

The lesson I take out from all these is: "intuitive discussions" about "random and deterministic parts of a dependent variable", may be useful to a degree -but somewhere along the road, one should remember that the estimators that will be the ones that will eventually make concrete and tangible our attempts at estimation and inference, are mathematical tools with specific properties and specific limits to what they can do, achieve and guarantee. And some times they cannot achieve what appears "powerfully logical and intuitive" in a non-mathematical approach of the matter.

Related Question