The topic you are asking about is multicollinearity. You might want to read some of the threads on CV categorized under the multicollinearity tag. @whuber's answer linked above in particular is also worth your time.
The assertion that "if two predictors are correlated and both are included in a model, one will be insignificant", is not correct. If there is a real effect of a variable, the probability that variable will be significant is a function of several things, such as the magnitude of the effect, the magnitude of the error variance, the variance of the variable itself, the amount of data you have, and the number of other variables in the model. Whether the variables are correlated is also relevant, but it doesn't override these facts. Consider the following simple demonstration in R
:
library(MASS) # allows you to generate correlated data
set.seed(4314) # makes this example exactly replicable
# generate sets of 2 correlated variables w/ means=0 & SDs=1
X0 = mvrnorm(n=20, mu=c(0,0), Sigma=rbind(c(1.00, 0.70), # r=.70
c(0.70, 1.00)) )
X1 = mvrnorm(n=100, mu=c(0,0), Sigma=rbind(c(1.00, 0.87), # r=.87
c(0.87, 1.00)) )
X2 = mvrnorm(n=1000, mu=c(0,0), Sigma=rbind(c(1.00, 0.95), # r=.95
c(0.95, 1.00)) )
y0 = 5 + 0.6*X0[,1] + 0.4*X0[,2] + rnorm(20) # y is a function of both
y1 = 5 + 0.6*X1[,1] + 0.4*X1[,2] + rnorm(100) # but is more strongly
y2 = 5 + 0.6*X2[,1] + 0.4*X2[,2] + rnorm(1000) # related to the 1st
# results of fitted models (skipping a lot of output, including the intercepts)
summary(lm(y0~X0[,1]+X0[,2]))
# Estimate Std. Error t value Pr(>|t|)
# X0[, 1] 0.6614 0.3612 1.831 0.0847 . # neither variable
# X0[, 2] 0.4215 0.3217 1.310 0.2075 # is significant
summary(lm(y1~X1[,1]+X1[,2]))
# Estimate Std. Error t value Pr(>|t|)
# X1[, 1] 0.57987 0.21074 2.752 0.00708 ** # only 1 variable
# X1[, 2] 0.25081 0.19806 1.266 0.20841 # is significant
summary(lm(y2~X2[,1]+X2[,2]))
# Estimate Std. Error t value Pr(>|t|)
# X2[, 1] 0.60783 0.09841 6.177 9.52e-10 *** # both variables
# X2[, 2] 0.39632 0.09781 4.052 5.47e-05 *** # are significant
The correlation between the two variables is lowest in the first example and highest in the third, yet neither variable is significant in the first example and both are in the last example. The magnitude of the effects is identical in all three cases, and the variances of the variables and the errors should be similar (they are stochastic, but drawn from populations with the same variance). The pattern we see here is due primarily to my manipulating the $N$s for each case.
The key concept to understand to resolve your questions is the variance inflation factor (VIF). The VIF is how much the variance of your regression coefficient is larger than it would otherwise have been if the variable had been completely uncorrelated with all the other variables in the model. Note that the VIF is a multiplicative factor, if the variable in question is uncorrelated the VIF=1. A simple understanding of the VIF is as follows: you could fit a model predicting a variable (say, $X_1$) from all other variables in your model (say, $X_2$), and get a multiple $R^2$. The VIF for $X_1$ would be $1/(1-R^2)$. Let's say the VIF for $X_1$ were $10$ (often considered a threshold for excessive multicollinearity), then the variance of the sampling distribution of the regression coefficient for $X_1$ would be $10\times$ larger than it would have been if $X_1$ had been completely uncorrelated with all the other variables in the model.
Thinking about what would happen if you included both correlated variables vs. only one is similar, but slightly more complicated than the approach discussed above. This is because not including a variable means the model uses less degrees of freedom, which changes the residual variance and everything computed from that (including the variance of the regression coefficients). In addition, if the non-included variable really is associated with the response, the variance in the response due to that variable will be included into the residual variance, making it larger than it otherwise would be. Thus, several things change simultaneously (the variable is correlated or not with another variable, and the residual variance), and the precise effect of dropping / including the other variable will depend on how those trade off. The best way to think through this issue is based on the counterfactual of how the model would differ if the variables were uncorrelated instead of correlated, rather than including or excluding one of the variables.
Armed with an understanding of the VIF, here are the answers to your questions:
- Because the variance of the sampling distribution of the regression coefficient would be larger (by a factor of the VIF) if it were correlated with other variables in the model, the p-values would be higher (i.e., less significant) than they otherwise would.
- The variances of the regression coefficients would be larger, as already discussed.
- In general, this is hard to know without solving for the model. Typically, if only one of two is significant, it will be the one that had the stronger bivariate correlation with $Y$.
- How the predicted values and their variance would change is quite complicated. It depends on how strongly correlated the variables are and the manner in which they appear to be associated with your response variable in your data. Regarding this issue, it may help you to read my answer here: Is there a difference between 'controlling for' and 'ignoring' other variables in multiple regression?
Variable selection based on "significance", AIC, BIC, or Cp is not a valid approach in this context. Lasso (L1) shrinkage works but you may be disappointed in the stability of the list of "important" predictors found by lasso.
The simplest approach to understanding co-linearity is variable clustering and redundancy analysis (e.g., in the R Hmisc
package functions varclus
and redun
). This approach is not tailored to the actual model you use. Logistic regression uses weighted $X'X$ calculations instead of regular $X'X$ considerations as used in variable clustering and redundancy analysis. But it will be close. To tailor the co-linearity assessment to the actual chosen outcome model, you can compute the correlation matrix of the maximum likelihood estimates of $\beta$ and even use that matrix as a similarity matrix in a hierarchical cluster analysis not unlike what varclus
does.
Various data reduction procedures, the oldest one being incomplete principal components regression, can avoid co-linearity problems at some expense of interpretability. In general, data reduction performs better than all stepwise variable selection algorithms because of the direct way that data reduction handles co-linearity.
You can get VIFs in logistic regression. See for example the vif
function that can be applied to lrm
fits in the R rms
package.
Best Answer
Let's predict income with two highly positively correlated variables: Years of work experience and number of carrots eaten in one's lifetime. Let's ignore omitted variable bias issues. Also, let's say years of work experience has a much greater impact on income than carrots eaten.
Your beta parameter estimates would be unbiased, but the standard errors of the parameter estimates would be greater than if the predictors were not correlated. Collinearity does not violate any assumptions of GLMs (unless there is perfect collinearity).
Collinearity is fundamentally a data problem. In small datasets, you might not have enough data to estimate beta coefficients. In large datasets, you likely will. Either way, you can interpret the beta parameters and the standard errors just as if collinearity were not an issue. Just be aware that some of your parameter estimates might not be significant.
In the event your parameter estimates are not significant, get more data. Dropping a variable that should be in your model ensures your estimates are biased. For example, if you were to drop the years of experience variables, the carrots eaten variables would become positively biased due to "absorbing" the impact of the dropped variable.
To answer the shared variance question, here is a fun test you can do in a statistical program of your choice:
Although there is a very large shared variance between x1 and x2, only x1 has a ceteris paribus, marginal effect relationship to y. In contrast, holding x1 constant and changing x2 does nothing to the expected value of y, so the shared variance is irrelevant.