Solved – Why doesn’t collinearity affect the predictions

multicollinearitymultiple regressionpredictive-models

I have read in many places that collinearity doesn't affect the predictions. It only affects the coefficient tests and confidence interval. As a result it cannot be used for causal inference but for making predictions it is 'safe' to use. Here there is a proof for the case when the predictors have perfect correlation. But why is this the case for imperfectly correlated predictors?

Let's say we have two predictors with correlation $0.8$, how does inclusion of these two in a multiple regression affect the predictions?
Below I have an example with $0.91$ correlation between two predictors, and it looks like there is some difference between predictions, especially larger on test data. But is it possible to have a formal analysis (like the one for perfect correlation in the link above).

> x1 <- 1:100 + rnorm(100)
> x2 <- x1^3 - rnorm(100, 50, 50)
> cor(x1, x2)
[1] 0.9179052
> y <- 201:300 + rnorm(100, 0, 20)
> cor(y, x1)
[1] 0.8463703
> cor(y, x2)
[1] 0.7621107
> d <- data.frame(x1= x1, x2= x2, y= y)
> m <- lm(y~x1, data = d[1:50,])
> m2 <- lm(y~x2, data = d[1:50,])
> m3 <- lm(y~x1+x2, data = d[1:50,])


> #on training data
> p1 <- predict(m, d[1,50])
> p2 <- predict(m2, d[1,50])
> p3 <- predict(m3, d[1,50])
> summary(p1 - p2)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-10.759  -3.269   1.320   0.000   3.872   5.166 
> summary(p1 - p3)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-0.8312 -0.6590 -0.2200  0.0000  0.5984  1.5835 
> summary(p2 - p3)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 -5.979  -4.550  -1.755   0.000   3.877  12.052 


> #on test data
> p4 <- predict(m, d[51:100,])
> p5 <- predict(m2, d[51:100,])
> p6 <- predict(m3, d[51:100,])
> summary(p4 - p5)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-232.308 -137.667  -76.297  -90.126  -33.208   -7.692 
> summary(p5 - p6)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  9.709  40.101  91.202 107.497 163.833 275.705 
> summary(p4 - p6)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  2.016   6.893  14.905  17.371  26.166  43.397 


> #model summaries
> summary(m)

Call:
lm(formula = y ~ x1, data = d[1:50, ])

Residuals:
    Min      1Q  Median      3Q     Max 
-45.424 -10.758   0.179  13.969  31.637 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 204.0189     5.5249  36.927  < 2e-16 ***
x1            0.8227     0.1898   4.335 7.43e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 19.04 on 48 degrees of freedom
Multiple R-squared:  0.2813,    Adjusted R-squared:  0.2664 
F-statistic: 18.79 on 1 and 48 DF,  p-value: 7.435e-05

> summary(m2)

Call:
lm(formula = y ~ x2, data = d[1:50, ])

Residuals:
    Min      1Q  Median      3Q     Max 
-44.193 -14.038   0.306  15.814  35.632 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 2.157e+02  3.736e+00  57.725  < 2e-16 ***
x2          2.923e-04  7.808e-05   3.743 0.000486 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 19.76 on 48 degrees of freedom
Multiple R-squared:  0.226, Adjusted R-squared:  0.2098 
F-statistic: 14.01 on 1 and 48 DF,  p-value: 0.0004857

> summary(m3)

Call:
lm(formula = y ~ x1 + x2, data = d[1:50, ])

Residuals:
    Min      1Q  Median      3Q     Max 
-45.848 -11.305   0.024  13.404  32.701 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.026e+02  7.700e+00  26.310   <2e-16 ***
x1           9.439e-01  4.908e-01   1.923   0.0605 .  
x2          -5.214e-05  1.945e-04  -0.268   0.7898    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 19.23 on 47 degrees of freedom
Multiple R-squared:  0.2824,    Adjusted R-squared:  0.2519 
F-statistic: 9.249 on 2 and 47 DF,  p-value: 0.0004101

Second question is how would inclusion of new variables that are linear combination of these two correlated variable affect the predictions? Let's say doing a regression on $x1$,$x2$, and $x3$ where $x3=x2+x1$.

> d <- cbind(d, data.frame(x3 = d$x1 + d$x2, x4 = d$x1 - d$x2))
> m4 <- lm(y~x3+x4, data = d[1:50,])
> p7 <- predict(m4, d[1:50,])
> p8 <- predict(m4, d[51:100,])
> summary(p7-p3)
      Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
-1.623e-11 -9.898e-12 -4.107e-12  3.604e-13  8.413e-12  3.175e-11 
> summary(p8-p6)
      Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
-1.003e-11  3.312e-11  8.608e-11  1.299e-10  2.283e-10  3.933e-10 

Best Answer

This is a situation where mathematical abstraction is a huge help. In the following I will not introduce any new ideas, nor carry out any calculations, but will only exploit basic, simple definitions of linear algebra to present an effective way of thinking about collinearity.

The definitions needed to understand this are vector space, subspace, linear combination, linear transformation, kernel, and orthogonal projection.


One abstraction that works well is to view the the columns of the $n\times p$ design matrix $X$ as elements of the vector space $\mathbb{R}^n.$ For any $p$-vector $\beta,$ interpret the expression

$$X\beta\tag{*}$$

as a linear combination of these columns. The set of all such possible linear combinations is a subspace $\mathbb{R}[X] \subset \mathbb{R}^n.$ (Its dimension is at most $p$ but could be smaller. We say $X$ is "collinear" when $\operatorname{dim}(\mathbb{R}[X])$ is strictly less than $p.$)

The expression $(*)$ defines a linear transformation, which to avoid abusing notation I will give a new name,

$$\mathcal{L}_X: \mathbb{R}^p \to \mathbb{R}[X] \subset \mathbb{R}^n;\quad \mathcal{L}_X(\beta) = X\beta.$$

Let's forget about this transformation a moment in order to focus on the image of $\mathcal{L}_X$ in $\mathbb{R}^n.$ To do so, call this subspace $V = \mathbb{R}[X]$ to emphasize that it's just some definite subspace and to de-emphasize that $X$ was used to figure exactly which subspace $V$ is. The response vector $y$ can also be viewed as an element of $\mathbb{R}^n.$

I hope the following characterization requires no proof (because it merely restates well-known results):

Ordinary Least Squares regression seeks a $\hat y \in V$ of minimal distance to $y.$ The solution is unique and is found by projecting $y$ orthogonally onto $V.$

Let us call this projection $\operatorname{proj}:\mathbb{R}^n \to V,$ so that

$$\hat y = \operatorname{proj}_V(y).$$

Note that $\operatorname{proj}$ depends only on (1) the metric on $\mathbb{R}^n,$ up to a scalar multiple (this is given by some assumption about the covariances of the error terms in the model) and (2) the subspace $V.$

Figure

This figure sketches the vector spaces and transformations among them. The model fitting occurs in $\mathbb{R}^n$ shown at the upper right, whereas how it is described depends on the objects at the lower left. The vectors $x_i\in\mathbb{R}^p$ are mapped to the corresponding columns of $X,$ which are elements of $V\subset \mathbb{R}^n.$ $\hat\beta$ is mapped via $\mathcal{L}_X$ to $\hat y.$

The point of these definitional trivialities is to make it clear that this characterization depends on $X$ only through the subspace $V=\mathbb{R}[X]$ it determines. Some immediate consequences are:

  1. The solution does not depend on whether the columns of $X$ are orthogonal.

  2. Introducing additional columns into $X$ that are already elements of $V$ changes nothing.

  3. Parameter estimates $\hat\beta$ are obtained by pulling $\hat y$ back via $\mathcal{L}_X.$ Equivalently, all such $\hat\beta$ satisfy the equation $$\mathcal{L}_X(\hat\beta) = \operatorname{proj}_V(y) = \hat y.$$

  4. Any two such parameter estimates $\hat\beta_1, \hat\beta_2\in\mathbb{R}^p$ must therefore differ by an element of the kernel of $\operatorname{proj}_V,$ because $$\operatorname{proj}_V(\hat\beta_2 - \hat\beta_1) = \operatorname{proj}_V(\hat\beta_2) - \operatorname{proj}_V(\hat\beta_1) = \hat y - \hat y = 0 \in \mathbb{R}^n.$$

Point (3) shows the prediction $\hat y$ does not depend on $\hat\beta$ and point (4) characterizes the extent to which $\hat\beta$ may be incompletely characterized. Of course, when $\operatorname{dim} V = p,$ $\operatorname{ker}(\operatorname{proj}_V) = 0$ and therefore $\hat\beta$ is uniquely determined.

Related Question