Solved – VIF to find multicollinearity

multicollinearityvariance-inflation-factor

I tried VIF on the Longley dataset to look for multicollinearity.
(I have used a custom function returned in https://beckmw.wordpress.com/2013/02/05/collinearity-and-stepwise-vif-selection/comment-page-1/#comment-1788)

Case 1. Without VIF, model showed Population, GNP, GNP.deflator not statistically significant by looking at the p-value.

lm(formula = Employed ~ ., data = longley)

Multiple R-squared: 0.9955, Adjusted R-squared: 0.9925

Case 2. I tried the VIF using the above function, It has removed GNP, GNP.deflator and Year. Whereas the Year variable was highly significant without VIF, p-value was 0.003037.

(If VIF is more than 10, multicollinearity is strongly suggested.)

require(fmsb)

VIF(lm(Employed~., data=longley)) 

VIF is 221 using fmsb package.

keep.dat <- vif_func(in_frame=longley[,-7],thresh=5,trace=T)

form.in<-paste('Employed ~',paste(keep.dat,collapse='+'))

form.in

fit<-lm(form.in,data=longley) summary(fit)

Multiple R-squared:  0.9696, Adjusted R-squared:  0.962 (using usdm pkg)

Multiple R-squared:  0.9696, Adjusted R-squared:  0.962 (using fmsb pkg)

Questions:

  1. Why the VIF removed the Year while doing VIF, since it was highly significant without applying VIF?

Before VIF:

## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -3.482e+03  8.904e+02  -3.911 0.003560 ** 
## GNP.deflator  1.506e-02  8.492e-02   0.177 0.863141    
## GNP          -3.582e-02  3.349e-02  -1.070 0.312681    
## Unemployed   -2.020e-02  4.884e-03  -4.136 0.002535 ** 
## Armed.Forces -1.033e-02  2.143e-03  -4.822 0.000944 ***
## Population   -5.110e-02  2.261e-01  -0.226 0.826212    
## Year          1.829e+00  4.555e-01   4.016 0.003037 **

After VIF:

 Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
 (Intercept)  -1.323091   4.211566  -0.314  0.75880    
 Unemployed   -0.012292   0.003354  -3.665  0.00324 ** 
 Armed.Forces -0.001893   0.003516  -0.538  0.60019    
 Population    0.605146   0.047617  12.709 2.55e-08 ***
  1. When there ise multicollinearity between two predictors, should we not remove one and retain the other? Here it seems to be removing both the variables.

enter image description here

Best Answer

First, instead of automatically removing variables using vif or any function, you should use collinearity indexes and proportion of variance explained to get a better understanding of what is going on. In R these are available in the colldiag function in the perturb package.

Second, when you have collinearity, there are a number of possible remedies. E.g.

  • Using a penalized method like ridge regression
  • Getting more data
  • Removing variables
  • Doing partial least squares regression
  • Doing principal components regression
Related Question