Solved – Beginner question about VIF and interactions

multicollinearityself-studyvariance-inflation-factor

I am giving a "blind idiot's" try at VIF (do not understand it, it is just quoted in a hurry by my course) and I notice that adding interactions seems to change the results considerably. This at my level of ignorance seems counterintuitive, is there I simple intuitive explanation?

fit_promo_disc <- lm(sales ~ discount + promo_media + promo_store, data = art_sales_df)
discount promo_media promo_store 
1.024042    1.026158    1.003295 

fit_promo_disc <- lm(sales ~ discount * promo_media * promo_store, data = art_sales_df)
discount promo_media promo_store discount:promo_media 
1.078486    5.840764    6.240330             5.693507
discount:promo_store          promo_media:promo_store 
            5.487231                         7.256468 
discount:promo_media:promo_store 
                        6.472021 

Is this the same or similar to
Multicollinearity Using VIF and Condition Indeces ?

Best Answer

The results make good sense. Just a bit of background. VIF stands for variance inflation factor. And, it measures the multiple of your regression coefficients variance due to the effect of that independent variable being correlated with the other variables.

The above issue is referred to as multicollinearity. And, the most accurate definition of the problem is that one independent variable can actually be almost perfectly estimated or regressed using the other independent variables. So, let's say a variable has a correlation of 0.95 with another. You regress that first variable with the other. You will derive a model that has an R Square = 0.95^2 = 0.90. In turn, this corresponds to a Tolerance of 1 - R Square = 1 - 0.90 = 0.10. And, finally you can derive the VIF which is equal to 1/Tolerance = 1/0.10 = 10. And, such a level of VIF denotes a very high multicollinearity level. A VIF of 10 is invariably considered too high. But, many social scientists use lower (more stringent) thresholds such as a VIF of 5 or even 4.

What does that mean? When your variables suffer from a very high level of multicollinearity, it can render their regression coefficients unstable. You can test that by rerunning your model several times and truncating the data to capture different segments of the data. With high multicollinearity, you will often see regression coefficients of such variables being unstable.

Ways to solve multicollinearity? There are some really easy ways and some really hard ones. The easy way is simply remove that one variable that is multicollinear to the others. Problem solved. The hard way, consider using a Principal Component Analysis (PCA) model. You will be ready for this model by the time you complete your PhD in statistics.

So, let's say your discount variable has a VIF of 2 times. It means, the regression coefficient of this variable is 2 times higher in your regression vs. if you had used this variable as the only regressor of estimating sales.

So, within your actual regression it shows that when you use a simple multiple regression your three variables: discount, promo_media, promo_store are almost not correlated since their respective VIFs are all very close to the minimum value of 1.0.

But, when you add the interaction term variables you notice that all such variables have very high VIFs and so do the other variables. That makes perfect sense since the interaction term variables are actually composed by your original three variables. So, by definition all those variables will be very highly correlated. And, all the VIFs are above 5 that is deemed problematic.

One thing to consider... is that the way your interaction variables are structured, you will inevitably get very high VIFs. It may not necessarily suggest you have to immediately junk your whole model. If you want to keep the entire model I would check out for three things: 1) are the regression coefficients reasonably stable when you use different cut of the data; 2) are the regression coefficients of the appropriate directional sign; and 3) are all your variables regression coefficients statistically significant. If you can answer a clear "yes" on all three counts you could argue that your model is still sound and robust. Be warn that many model peer-reviewers (your professor?) may not necessarily agree with my rational. But, some of them on a good day when they are in a good mood probably will. In quantitative modeling there are a ton of debates without clear finish lines. And, the multicollinearity issue is one of them.

To further investigate your model make sure that your model with interaction variables is truly better than your simpler model. Test them on Hold Out sample to see which one predicts better using new data. Also, do the same comparison with models with just one interaction term variable, then two, then three. You most probably do not need all three interaction variables to generate descent model prediction.

Before you ask any further questions on multicollinearity I suggest you study the related article in Wikipedia in detail. This platform is not the best to provide you textbook length material with follow up comments limited to 400 characters.