Solved – High correlation among two variables but VIFs do not indicate collinearity

correlationmulticollinearityvariance-inflation-factor

What would you go for in assessing collinearity – correlation or VIFs?

Say you run pairplots and calculate Pearson correlation coefficients between pairs of explanatory variables. Two of them have a correlation coefficient of around 0.8, which is rather high. This would suggest that including both variables in the same regression model might not be a good idea. But say you include them anyway. You then run the vif command from the car package. The Variance Inflation Factors (VIFs) are all below a threshold value (say 3, for example). This indicates no problematic collinearity.

Would you say that:

1) Both can be included in the model if VIFs are low, regardless of high correlation?

or

2) High correlation means one variable should be dropped immediately at the beginning, regardless of VIFs?

(This is not related to any particular data or model, hence no specific data example provided, but I have experienced this in the past. I know there have been threads related to this question, but I have not found an actual answer to my specific question.)

Best Answer

I would use condition indexes rather than either VIFs or correlations; I wrote my dissertation about this, but you can also see the work of David Belsley, e.g. this book. But if I had to choose between VIFs and correlations, I'd go with VIFs. Belsley shows that fairly high correlations are not always problematic.

If you are using R, another method that seems good to me is to use the perturb package to see if the collinearity is problematic.