I have read in many places that collinearity doesn't affect the predictions. It only affects the coefficient tests and confidence interval. As a result it cannot be used for causal inference but for making predictions it is 'safe' to use. Here there is a proof for the case when the predictors have perfect correlation. But why is this the case for imperfectly correlated predictors?
Let's say we have two predictors with correlation $0.8$, how does inclusion of these two in a multiple regression affect the predictions?
Below I have an example with $0.91$ correlation between two predictors, and it looks like there is some difference between predictions, especially larger on test data. But is it possible to have a formal analysis (like the one for perfect correlation in the link above).
> x1 <- 1:100 + rnorm(100)
> x2 <- x1^3 - rnorm(100, 50, 50)
> cor(x1, x2)
[1] 0.9179052
> y <- 201:300 + rnorm(100, 0, 20)
> cor(y, x1)
[1] 0.8463703
> cor(y, x2)
[1] 0.7621107
> d <- data.frame(x1= x1, x2= x2, y= y)
> m <- lm(y~x1, data = d[1:50,])
> m2 <- lm(y~x2, data = d[1:50,])
> m3 <- lm(y~x1+x2, data = d[1:50,])
> #on training data
> p1 <- predict(m, d[1,50])
> p2 <- predict(m2, d[1,50])
> p3 <- predict(m3, d[1,50])
> summary(p1 - p2)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-10.759 -3.269 1.320 0.000 3.872 5.166
> summary(p1 - p3)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.8312 -0.6590 -0.2200 0.0000 0.5984 1.5835
> summary(p2 - p3)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-5.979 -4.550 -1.755 0.000 3.877 12.052
> #on test data
> p4 <- predict(m, d[51:100,])
> p5 <- predict(m2, d[51:100,])
> p6 <- predict(m3, d[51:100,])
> summary(p4 - p5)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-232.308 -137.667 -76.297 -90.126 -33.208 -7.692
> summary(p5 - p6)
Min. 1st Qu. Median Mean 3rd Qu. Max.
9.709 40.101 91.202 107.497 163.833 275.705
> summary(p4 - p6)
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.016 6.893 14.905 17.371 26.166 43.397
> #model summaries
> summary(m)
Call:
lm(formula = y ~ x1, data = d[1:50, ])
Residuals:
Min 1Q Median 3Q Max
-45.424 -10.758 0.179 13.969 31.637
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 204.0189 5.5249 36.927 < 2e-16 ***
x1 0.8227 0.1898 4.335 7.43e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 19.04 on 48 degrees of freedom
Multiple R-squared: 0.2813, Adjusted R-squared: 0.2664
F-statistic: 18.79 on 1 and 48 DF, p-value: 7.435e-05
> summary(m2)
Call:
lm(formula = y ~ x2, data = d[1:50, ])
Residuals:
Min 1Q Median 3Q Max
-44.193 -14.038 0.306 15.814 35.632
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.157e+02 3.736e+00 57.725 < 2e-16 ***
x2 2.923e-04 7.808e-05 3.743 0.000486 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 19.76 on 48 degrees of freedom
Multiple R-squared: 0.226, Adjusted R-squared: 0.2098
F-statistic: 14.01 on 1 and 48 DF, p-value: 0.0004857
> summary(m3)
Call:
lm(formula = y ~ x1 + x2, data = d[1:50, ])
Residuals:
Min 1Q Median 3Q Max
-45.848 -11.305 0.024 13.404 32.701
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.026e+02 7.700e+00 26.310 <2e-16 ***
x1 9.439e-01 4.908e-01 1.923 0.0605 .
x2 -5.214e-05 1.945e-04 -0.268 0.7898
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 19.23 on 47 degrees of freedom
Multiple R-squared: 0.2824, Adjusted R-squared: 0.2519
F-statistic: 9.249 on 2 and 47 DF, p-value: 0.0004101
Second question is how would inclusion of new variables that are linear combination of these two correlated variable affect the predictions? Let's say doing a regression on $x1$,$x2$, and $x3$ where $x3=x2+x1$.
> d <- cbind(d, data.frame(x3 = d$x1 + d$x2, x4 = d$x1 - d$x2))
> m4 <- lm(y~x3+x4, data = d[1:50,])
> p7 <- predict(m4, d[1:50,])
> p8 <- predict(m4, d[51:100,])
> summary(p7-p3)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-1.623e-11 -9.898e-12 -4.107e-12 3.604e-13 8.413e-12 3.175e-11
> summary(p8-p6)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-1.003e-11 3.312e-11 8.608e-11 1.299e-10 2.283e-10 3.933e-10
Best Answer
This is a situation where mathematical abstraction is a huge help. In the following I will not introduce any new ideas, nor carry out any calculations, but will only exploit basic, simple definitions of linear algebra to present an effective way of thinking about collinearity.
The definitions needed to understand this are vector space, subspace, linear combination, linear transformation, kernel, and orthogonal projection.
One abstraction that works well is to view the the columns of the $n\times p$ design matrix $X$ as elements of the vector space $\mathbb{R}^n.$ For any $p$-vector $\beta,$ interpret the expression
$$X\beta\tag{*}$$
as a linear combination of these columns. The set of all such possible linear combinations is a subspace $\mathbb{R}[X] \subset \mathbb{R}^n.$ (Its dimension is at most $p$ but could be smaller. We say $X$ is "collinear" when $\operatorname{dim}(\mathbb{R}[X])$ is strictly less than $p.$)
The expression $(*)$ defines a linear transformation, which to avoid abusing notation I will give a new name,
$$\mathcal{L}_X: \mathbb{R}^p \to \mathbb{R}[X] \subset \mathbb{R}^n;\quad \mathcal{L}_X(\beta) = X\beta.$$
Let's forget about this transformation a moment in order to focus on the image of $\mathcal{L}_X$ in $\mathbb{R}^n.$ To do so, call this subspace $V = \mathbb{R}[X]$ to emphasize that it's just some definite subspace and to de-emphasize that $X$ was used to figure exactly which subspace $V$ is. The response vector $y$ can also be viewed as an element of $\mathbb{R}^n.$
I hope the following characterization requires no proof (because it merely restates well-known results):
Let us call this projection $\operatorname{proj}:\mathbb{R}^n \to V,$ so that
$$\hat y = \operatorname{proj}_V(y).$$
Note that $\operatorname{proj}$ depends only on (1) the metric on $\mathbb{R}^n,$ up to a scalar multiple (this is given by some assumption about the covariances of the error terms in the model) and (2) the subspace $V.$
This figure sketches the vector spaces and transformations among them. The model fitting occurs in $\mathbb{R}^n$ shown at the upper right, whereas how it is described depends on the objects at the lower left. The vectors $x_i\in\mathbb{R}^p$ are mapped to the corresponding columns of $X,$ which are elements of $V\subset \mathbb{R}^n.$ $\hat\beta$ is mapped via $\mathcal{L}_X$ to $\hat y.$
The point of these definitional trivialities is to make it clear that this characterization depends on $X$ only through the subspace $V=\mathbb{R}[X]$ it determines. Some immediate consequences are:
The solution does not depend on whether the columns of $X$ are orthogonal.
Introducing additional columns into $X$ that are already elements of $V$ changes nothing.
Parameter estimates $\hat\beta$ are obtained by pulling $\hat y$ back via $\mathcal{L}_X.$ Equivalently, all such $\hat\beta$ satisfy the equation $$\mathcal{L}_X(\hat\beta) = \operatorname{proj}_V(y) = \hat y.$$
Any two such parameter estimates $\hat\beta_1, \hat\beta_2\in\mathbb{R}^p$ must therefore differ by an element of the kernel of $\operatorname{proj}_V,$ because $$\operatorname{proj}_V(\hat\beta_2 - \hat\beta_1) = \operatorname{proj}_V(\hat\beta_2) - \operatorname{proj}_V(\hat\beta_1) = \hat y - \hat y = 0 \in \mathbb{R}^n.$$
Point (3) shows the prediction $\hat y$ does not depend on $\hat\beta$ and point (4) characterizes the extent to which $\hat\beta$ may be incompletely characterized. Of course, when $\operatorname{dim} V = p,$ $\operatorname{ker}(\operatorname{proj}_V) = 0$ and therefore $\hat\beta$ is uniquely determined.