Solved – Residuals in regression should not be correlated with another variable

correlationmultiple regressionresiduals

This minitab post on checking residuals says:

If you can predict the residuals with another variable, that variable
should be included in the model.

I would think if we can predict residuals with another variable, then I would not include that variable in the model. It is not clear to me why including the variable makes sense.

Best Answer

This is based on the approach that there exists a set of explanatory variables (EV) whose variability captures everything in the variability of the dependent variable bar "random, unpredictable noise". As the link itself says clearly,

"The idea is that the deterministic portion of your model is so good at explaining (or predicting) the response that only the inherent randomness of any real-world phenomenon remains leftover for the error portion."

So, if the residuals $\hat {\mathbf u}$ from the regression of say, DV $y$ on EVs $\{X_1, X_2\}$ are correlated with some third variable $Z$ (which was not included in the regression), it means that the residuals do not behave like "random, unpredictable noise". So the set of EVs that was used, is not that set which "captures everything in the variability of $y$", and so it can be "improved upon".

Let's see what this means in the merciless language of mathematics.

We have the model (matrix notation) (and a sample of size $n$)

$$\mathbf y = \mathbf X\beta + \mathbf u$$

and let's assume that the nice properties for the OLS estimator to triumph, do hold:

$$Ε(\mathbf u \mid \mathbf X) = 0, {\rm Var}(\mathbf u \mid \mathbf X) = \sigma^2 I$$

Note that "the error is random, unpredictable noise" is not part of these assumptions. What the above says is that the error is unpredictable with respect to the specific set of regressors (check out the "Conditional Expectation Function" approach to linear regression). Running the estimation we obtain the OLS estimator

$$\hat \beta = (\mathbf X'\mathbf X)^{-1}\mathbf X'\mathbf y = \beta + (\mathbf X'\mathbf X)^{-1}\mathbf X'\mathbf u$$

and from it the residuals. The residuals are an estimator of the error, and, since

$$u_i = y_i - \mathbf x'_i\beta = \mathbf x'_i \hat \beta +\hat u_i - \mathbf x'_i\beta$$

we can re-arrange to get

$$\hat u_i = u_i - \mathbf x'_i(\hat \beta - \beta),\;\; i=1,...,n$$

So (and since moreover the residuals by construction have zero mean), being correlated with some variable $Z$, which is not in $\mathbf X$, it means,

$${\rm Cov}(\hat u_i, z_i) = E\Big(z_i\cdot [u_i - \mathbf x'_i(\hat \beta - \beta)]\Big) = E(u_iz_i) - E\Big(z_i\cdot\mathbf x'_i(\mathbf X'\mathbf X)^{-1}\mathbf X'\mathbf u\Big) \neq 0$$

Now, apply the Law of Iterated Expectations on the second term of the last expression to get

$$E(u_iz_i) - E\Big[E\Big(z\cdot\mathbf x'_i(\mathbf X'\mathbf X)^{-1}\mathbf X'\mathbf u\mid Z, \mathbf X\Big)\Big] \neq 0$$

$$\implies E(u_iz_i) - E\Big[z\cdot\mathbf x'_i(\mathbf X'\mathbf X)^{-1}\mathbf X'E\Big(\mathbf u\mid Z, \mathbf X\Big)\Big] \neq 0$$

The possible scenarios here are:

A) $E(u_iz_i) = 0, \;\;\; E\Big(\mathbf u\mid Z, \mathbf X\Big) \neq 0$ In this case, if you follow the advice given, and re-run the regression of $\mathbf y$ using $\{Z, \mathbf X\}$ as regressors this time, you should know that, in the attempt to make the residuals behave as "random noise" with respect to other variables, not in $\mathbf X$ (which is far more ambitious than what we usually assume for our models), you will lose the finite-sample unbiasedness property of the estimator (due to $E\Big(\mathbf u\mid Z, \mathbf X)\Big) \neq 0$), but at least you will retain asymptotic consistency (due to $E(u_iz_i) = 0$, and under the assumptions already made). That's a good trade-off, Econometrics has long abandoned the hopes for finite-sample unbiasedness (and if you look around this site, you will find out that Statisticians in general have adopted, or pioneered, the same stance).

B) $E(u_iz_i) \neq 0, \;\;\; E\Big(\mathbf u\mid Z, \mathbf X\Big) \neq 0$

Here, you will lose both unbiasedness and consistency, and your estimator starts looking rather weak in good properties, to put it mildly, and I am not sure that "making the residuals random" justifies the price to pay. To my eyes, it doesn't because the estimates for $\beta$ are now very unreliable, and so the "random" residuals achievement becomes an artificial construct, and not a step closer to the true associations. And what if there is yet another variable $W$ which still can predict the new residuals obtained?

So the advice given may send you to better places, may send you to much worse places. Therefore, the critical issue is : Can you obtain evidence related to which of the two scenarios holds in a given case? - but this exceeds the scope of this answer.

The lesson I take out from all these is: "intuitive discussions" about "random and deterministic parts of a dependent variable", may be useful to a degree -but somewhere along the road, one should remember that the estimators that will be the ones that will eventually make concrete and tangible our attempts at estimation and inference, are mathematical tools with specific properties and specific limits to what they can do, achieve and guarantee. And some times they cannot achieve what appears "powerfully logical and intuitive" in a non-mathematical approach of the matter.

Related Solutions

Solved – Residuals correlated positively with response variable strongly in linear regression

1) Residuals do correlate positively with observed values in many, many cases. Think of it this way - a very large positive error ("error" is the "true residual", to misuse the language) means that the corresponding observation is, all other things equal, likely to be very large in a positive direction. A very large negative error means that the corresponding observation is likely to be very large in a negative direction. If the $R^2$ of the regression is not large, then the variability of the errors will be the dominating effect on the variability of the target variable, and you will see this effect in your plots and correlations.

For example, consider the model $y_i = a + x_i + e_i$, which we'll model as $y_i = a + bx_i + e_i$, (which is correct for $b = 1$.) Here's the result of a regression with 100 observations:

e <- rnorm(100)
x <- rnorm(100)
y <- 1 + x + e

foo <- lm(y~x)
plot(residuals(foo)~y, xlab="y", ylab="Residuals")

> summary(foo)

Call:
lm(formula = y ~ x)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.3292 -0.8280 -0.0448  0.8213  2.9450 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   0.8498     0.1288   6.600 2.12e-09 ***
x             0.8929     0.1316   6.787 8.81e-10 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 1.286 on 98 degrees of freedom
Multiple R-squared: 0.3197, Adjusted R-squared: 0.3128 
F-statistic: 46.06 on 1 and 98 DF,  p-value: 8.813e-10

enter image description here

Note that we achieved a fairly respectable (in some fields) $R^2$ of 0.32.

We can obscure this effect with a different model:

y <- 1 + 5*x + e

foo <- lm(y~x)
plot(residuals(foo)~y, xlab="y", ylab="Residuals")

which has an $R^2$ of 0.93 and the following residual plot:

enter image description here

Here the correlation between $y$ and the residuals is about 0.25, but it's a lot less obvious on the plot.

2) Residuals have correlation zero with fitted values in a linear regression, by construction. Is your statement "... weakly correlated with fitted Y negatively" based solely upon looking at the plot, or did you actually calculate the correlation? If the former, appearances can be deceiving... if the latter, something is wrong; possibly you aren't looking at what you think you're looking at.

Best Answer

Related Solutions

Solved – Residuals correlated positively with response variable strongly in linear regression

Related Question