I tried VIF on the Longley dataset to look for multicollinearity.
(I have used a custom function returned in https://beckmw.wordpress.com/2013/02/05/collinearity-and-stepwise-vif-selection/comment-page-1/#comment-1788)
Case 1. Without VIF, model showed Population, GNP, GNP.deflator not statistically significant by looking at the p-value.
lm(formula = Employed ~ ., data = longley)
Multiple R-squared: 0.9955, Adjusted R-squared: 0.9925
Case 2. I tried the VIF using the above function, It has removed GNP, GNP.deflator and Year. Whereas the Year variable was highly significant without VIF, p-value was 0.003037.
(If VIF is more than 10, multicollinearity is strongly suggested.)
require(fmsb)
VIF(lm(Employed~., data=longley))
VIF is 221 using fmsb package.
keep.dat <- vif_func(in_frame=longley[,-7],thresh=5,trace=T)
form.in<-paste('Employed ~',paste(keep.dat,collapse='+'))
form.in
fit<-lm(form.in,data=longley) summary(fit)
Multiple R-squared: 0.9696, Adjusted R-squared: 0.962 (using usdm pkg)
Multiple R-squared: 0.9696, Adjusted R-squared: 0.962 (using fmsb pkg)
Questions:
- Why the VIF removed the Year while doing VIF, since it was highly significant without applying VIF?
Before VIF:
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.482e+03 8.904e+02 -3.911 0.003560 **
## GNP.deflator 1.506e-02 8.492e-02 0.177 0.863141
## GNP -3.582e-02 3.349e-02 -1.070 0.312681
## Unemployed -2.020e-02 4.884e-03 -4.136 0.002535 **
## Armed.Forces -1.033e-02 2.143e-03 -4.822 0.000944 ***
## Population -5.110e-02 2.261e-01 -0.226 0.826212
## Year 1.829e+00 4.555e-01 4.016 0.003037 **
After VIF:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.323091 4.211566 -0.314 0.75880
Unemployed -0.012292 0.003354 -3.665 0.00324 **
Armed.Forces -0.001893 0.003516 -0.538 0.60019
Population 0.605146 0.047617 12.709 2.55e-08 ***
- When there ise multicollinearity between two predictors, should we not remove one and retain the other? Here it seems to be removing both the variables.
Best Answer
First, instead of automatically removing variables using vif or any function, you should use collinearity indexes and proportion of variance explained to get a better understanding of what is going on. In
R
these are available in thecolldiag
function in theperturb
package.Second, when you have collinearity, there are a number of possible remedies. E.g.