Solved – the effect of having correlated predictors in a multiple regression model

linear modelmulticollinearitymultiple regressionp-valueregression

I learned in my linear models class that if two predictors are correlated and both are included in a model, one will be insignificant. For example, assume the size of a house and the number of bedrooms are correlated. When predicting the cost of a house using these two predictors, one of them can be dropped because they are both providing a lot of the same information. Intuitively, this makes sense, but I have a some more technical questions:

How does this effect manifest itself in p-values of the regression coefficients when including only one or including both predictors in the model?
How does the variance of the regression coefficients get affected by including both predictors in the model or just having one?
How do I know which predictor the model will choose to be less significant?
How does including only one or including both predictors change the value/variance of my forecasted cost?

Best Answer

The topic you are asking about is multicollinearity. You might want to read some of the threads on CV categorized under the multicollinearity tag. @whuber's answer linked above in particular is also worth your time.

The assertion that "if two predictors are correlated and both are included in a model, one will be insignificant", is not correct. If there is a real effect of a variable, the probability that variable will be significant is a function of several things, such as the magnitude of the effect, the magnitude of the error variance, the variance of the variable itself, the amount of data you have, and the number of other variables in the model. Whether the variables are correlated is also relevant, but it doesn't override these facts. Consider the following simple demonstration in R:

library(MASS)    # allows you to generate correlated data
set.seed(4314)   # makes this example exactly replicable

# generate sets of 2 correlated variables w/ means=0 & SDs=1
X0 = mvrnorm(n=20,   mu=c(0,0), Sigma=rbind(c(1.00, 0.70),    # r=.70
                                            c(0.70, 1.00)) )
X1 = mvrnorm(n=100,  mu=c(0,0), Sigma=rbind(c(1.00, 0.87),    # r=.87
                                            c(0.87, 1.00)) )
X2 = mvrnorm(n=1000, mu=c(0,0), Sigma=rbind(c(1.00, 0.95),    # r=.95
                                            c(0.95, 1.00)) )
y0 = 5 + 0.6*X0[,1] + 0.4*X0[,2] + rnorm(20)    # y is a function of both
y1 = 5 + 0.6*X1[,1] + 0.4*X1[,2] + rnorm(100)   #  but is more strongly
y2 = 5 + 0.6*X2[,1] + 0.4*X2[,2] + rnorm(1000)  #  related to the 1st

# results of fitted models (skipping a lot of output, including the intercepts)
summary(lm(y0~X0[,1]+X0[,2]))
#             Estimate Std. Error t value Pr(>|t|)    
# X0[, 1]       0.6614     0.3612   1.831   0.0847 .     # neither variable
# X0[, 2]       0.4215     0.3217   1.310   0.2075       #  is significant
summary(lm(y1~X1[,1]+X1[,2]))
#             Estimate Std. Error t value Pr(>|t|)    
# X1[, 1]      0.57987    0.21074   2.752  0.00708 **    # only 1 variable
# X1[, 2]      0.25081    0.19806   1.266  0.20841       #  is significant
summary(lm(y2~X2[,1]+X2[,2]))
#             Estimate Std. Error t value Pr(>|t|)    
# X2[, 1]      0.60783    0.09841   6.177 9.52e-10 ***   # both variables
# X2[, 2]      0.39632    0.09781   4.052 5.47e-05 ***   #  are significant

The correlation between the two variables is lowest in the first example and highest in the third, yet neither variable is significant in the first example and both are in the last example. The magnitude of the effects is identical in all three cases, and the variances of the variables and the errors should be similar (they are stochastic, but drawn from populations with the same variance). The pattern we see here is due primarily to my manipulating the $N$s for each case.

The key concept to understand to resolve your questions is the variance inflation factor (VIF). The VIF is how much the variance of your regression coefficient is larger than it would otherwise have been if the variable had been completely uncorrelated with all the other variables in the model. Note that the VIF is a multiplicative factor, if the variable in question is uncorrelated the VIF=1. A simple understanding of the VIF is as follows: you could fit a model predicting a variable (say, $X_1$) from all other variables in your model (say, $X_2$), and get a multiple $R^2$. The VIF for $X_1$ would be $1/(1-R^2)$. Let's say the VIF for $X_1$ were $10$ (often considered a threshold for excessive multicollinearity), then the variance of the sampling distribution of the regression coefficient for $X_1$ would be $10\times$ larger than it would have been if $X_1$ had been completely uncorrelated with all the other variables in the model.

Thinking about what would happen if you included both correlated variables vs. only one is similar, but slightly more complicated than the approach discussed above. This is because not including a variable means the model uses less degrees of freedom, which changes the residual variance and everything computed from that (including the variance of the regression coefficients). In addition, if the non-included variable really is associated with the response, the variance in the response due to that variable will be included into the residual variance, making it larger than it otherwise would be. Thus, several things change simultaneously (the variable is correlated or not with another variable, and the residual variance), and the precise effect of dropping / including the other variable will depend on how those trade off. The best way to think through this issue is based on the counterfactual of how the model would differ if the variables were uncorrelated instead of correlated, rather than including or excluding one of the variables.

Armed with an understanding of the VIF, here are the answers to your questions:

Because the variance of the sampling distribution of the regression coefficient would be larger (by a factor of the VIF) if it were correlated with other variables in the model, the p-values would be higher (i.e., less significant) than they otherwise would.
The variances of the regression coefficients would be larger, as already discussed.
In general, this is hard to know without solving for the model. Typically, if only one of two is significant, it will be the one that had the stronger bivariate correlation with $Y$.
How the predicted values and their variance would change is quite complicated. It depends on how strongly correlated the variables are and the manner in which they appear to be associated with your response variable in your data. Regarding this issue, it may help you to read my answer here: Is there a difference between 'controlling for' and 'ignoring' other variables in multiple regression?

Best Answer

Related Solutions

Solved – When 2 variables are highly correlated can one be significant and the other not in a regression

Solved – Choosing predictors in regression analysis and multicollinearity

Related Question