Solved – What information VIF can provide but correlation matrix cannot in detecting multicollinearity

multicollinearitymultiple regressionregression

In multiple linear regression, the correlation matrix of predictors already clearly indicates the strength of correlation between any two predictors. Why do we use any other tools, such as VIF, to check the existence of multicollinearity?

Best Answer

VIF can help identify multicollinearity, i.e. the case where one variable is strongly correlated with a linear combination (weighted sum) of several variables. This cannot necessarily be detected by looking at individual correlations. As @gung's answer to this question (which asks a few too many questions at once) says:

If the correlation between two variables were r≥.95, then most data analysts would say you had problematic collinearity. However, you can have multiple variables where no two variables have a pairwise correlation that high, and still have problematic collinearity hidden amongst the whole set of variables.

Here's a particular example: suppose you are trying to model/predict based on a set of compositional variables, i.e. you know the strength of someone's preferences brand A or brand B or brand C or brand D ... or brand Z, which add up to 1 overall (by definition/construction), and you want to use "prefers brand *" as a set of predictors in the same model. Preferences for particular pairs of brands may be either positively or negatively correlated, but overall the set of predictors contains only 25, not 26, pieces of information. So the correlation between $(A+B+\ldots+Y)$ and $Z$ (or between any preference and the sum of all of the other preferences) is exactly -1, even though the correlations of particular brand preferences with $Z$ can be all over the place.

Related Solutions

Solved – How to test and avoid multicollinearity in mixed linear model

For VIF calculation usdm can also be package ( I need to install "usdm")

library(usdm)
df = # Data Frame
vif(df)

If VIF > 4.0 then I generally assume multicollinearity remove all those Predictor Variables before fitting them into my model

Solved – High correlation between two independent variables, but no multicollinearity

Let XF and XA be respectively the forecast of X and the actual value of X. If the forecast is good enough, the correlation between XF and XA will be high, as it is here. You use the incremental change E, which I assume is defined as E=XF-XA.

E is just the error of your forecast. It can be correlated with the actual value XA (or XF) but does not have to be so. In this case, it is not, given that the VIF is very low. So, the multicollinearity is gone but the interpretation of your coefficient changes. Before, you had both XA and XF in your regression, now you have XA and E (or XF and E, not clear from your question).

But absent more context, neither model makes much sense to me. In the first case, the multicollinearity is high and so it is not clear why you just don't keep only one of the two variables. In the second case, the multicollinearity is low but it is not clear why you would want the error of the forecast in your model, in addition to the forecast itself. So, without more information, I would suggest to use either XA or XF.

Best Answer

Related Solutions

Solved – How to test and avoid multicollinearity in mixed linear model

Solved – High correlation between two independent variables, but no multicollinearity

Related Question