VIF Analysis – Understanding Why VIF Drops When Some Dummy Variables Are Deleted

categorical-encodingleast squaresmulticollinearitypythonvariance-inflation-factor

Is my model valid even with the high VIF? Does it matter which dummy variable I drop as the reference point?

I have a a category variable (Fruit) that I converted to dummy variables: columns Apple, Banana, Cranberry, Durian, etc. I deleted dummy column Apple, so it acts like the baseline.

When I run Ordinary Least Squared models, the VIF is nan, which drops to 16 with Banana gone, and further drops to 4 when I deleted dummy variable Cranberry.

I want to avoid multicollinearity, but I thought dummy variables must be kept together, not cherry picked.

I also ran a correlation matrix on the dummy variables, and none of them exhibited a correlation higher than 0.2.

Example code for VIF:

df_mc_features = model_mc.model.exog

mc_vif = [variance_inflation_factor(df_mc_features, i) for i in range(df_mc_features.shape[1])]

display('Median VIF:', np.median(np.array(mc_vif)));
display('Average VIF:', np.array(mc_vif).mean());

Example code for OLS:

scaler = StandardScaler()
data = df.copy()
scaler.fit(data)
data_scaled = pd.DataFrame(scaler.transform(data), columns=data.columns)

df_mc_y = data_scaled['Target Variable'].copy()

df_mc_x = data_scaled.drop(['Target Variable'], axis=1).copy()         

model_mc = sm.OLS(list(df_mc_y.astype(float)), 
                 sm.add_constant(df_mc_x.astype(float)), missing='drop').fit()

model_mc.summary()

Best Answer

Multicollinearity is likely to occur with dummy variables. And for the most part, it just does not matter. See Paul Allison write about it here: https://statisticalhorizons.com/multicollinearity

Yes, the standard errors might be large, but it does not affect the $F$-test for the factor as a whole. One thing you can do is switch around the baseline group until you have the smallest VIFs across the board $-$ the indicator with the largest sample size should give the best results.

A small correlation between dummies is not inconsistent with large VIF values. Imagine you have 7 levels of the factor. Each dummy may be weakly correlated with any other dummy, but knowledge of five of seven dummies will give you good predictions for the any of the remaining two dummies, hence, the high VIF.

Note that VIF by itself is not a good reason for dropping dummy variables.