Multicollinearity problem is well studied in actually most econometric textbooks. Moreover there is a good article in wikipedia which actually summarizes most of the key issues.
In practice one starts to bear in mind the multicollinearity problem if it causes some visual signs of parameter instability (most of them are implied by non (poor) invertability of $X^TX$ matrix):
- large changes in parameter estimates while performing rolling regressions or estimates on smaller sub-samples of the data
- averaging of parameter estimates, the latter may fall to be insignificant (by $t$ tests) even though junk-regression $F$ test shows high joint significance of the results
- VIF statistic (average value of auxiliary regressions) merely depends on your requirements to tolerance level, most practical suggestions put an acceptable tolerance to be lower than 0.2 or 0.1 meaning that corresponding averages of auxiliary regressions $R^2$ should be higher than 0.9 or 0.8 to detect the problem. Thus VIF should be larger than rule-of-thumb's 10 and 5 values. In small samples (less than 50 points) 5 is preferable, in larger you can go to larger values.
- Condition index is an alternative to VIF in your case neither VIF nor CI show the problem is left, so you may be satisfied statistically on this result, but...
probably not theoretically, since it may happen (and usually is the case) that you need all variables to be present in the model. Excluding relevant variables (omitted variable problem) will make biased and inconsistent parameter estimates anyway. On the other hand you may be forced to include all focus variables simply because your analysis is based on it. In data-mining approach though you are more technical in searching for the best fit.
So keep in mind the alternatives (that I would use myself):
- obtain more data points (recall that VIF requirements are smaller for larger data set and the explanatory variables if they are slowly varying, may change for some crucial points in time or cross-section)
- search for lattent factors through principal components (the latter are orthogonal combinations so not multi-collinear by the construction, more over involve all explanatory variables)
- ridge-regression (it introduces small bias in parameter estimates, but makes them highly stable)
Some other tricks are in the wiki article noted above.
I cannot reproduce exactly this phenomenon, but I can demonstrate that VIF does not necessarily increase as the number of categories increases.
The intuition is simple: categorical variables can be made orthogonal by suitable experimental designs. Therefore, there should in general be no relationship between numbers of categories and multicollinearity.
Here is an R
function to create categorical datasets with specifiable numbers of categories (for two independent variables) and specifiable amount of replication for each category. It represents a balanced study in which every combination of category is observed an equal number of times, $n$:
trial <- function(n, k1=2, k2=2) {
df <- expand.grid(1:k1, 1:k2)
df <- do.call(rbind, lapply(1:n, function(i) df))
df$y <- rnorm(k1*k2*n)
fit <- lm(y ~ Var1+Var2, data=df)
vif(fit)
}
Applying it, I find the VIFs are always at their lowest possible values, $1$, reflecting the balancing (which translates to orthogonal columns in the design matrix). Some examples:
sapply(1:5, trial) # Two binary categories, 1-5 replicates per combination
sapply(1:5, function(i) trial(i, 10, 3)) # 30 categories, 1-5 replicates
This suggests the multicollinearity may be growing due to a growing imbalance in the design. To test this, insert the line
df <- subset(df, subset=(y < 0))
before the fit
line in trial
. This removes half the data at random. Re-running
sapply(1:5, function(i) trial(i, 10, 3))
shows that the VIFs are no longer equal to $1$ (but they remain close to it, randomly). They still do not increase with more categories: sapply(1:5, function(i) trial(i, 10, 10))
produces comparable values.
Best Answer
The results make good sense. Just a bit of background. VIF stands for variance inflation factor. And, it measures the multiple of your regression coefficients variance due to the effect of that independent variable being correlated with the other variables.
The above issue is referred to as multicollinearity. And, the most accurate definition of the problem is that one independent variable can actually be almost perfectly estimated or regressed using the other independent variables. So, let's say a variable has a correlation of 0.95 with another. You regress that first variable with the other. You will derive a model that has an R Square = 0.95^2 = 0.90. In turn, this corresponds to a Tolerance of 1 - R Square = 1 - 0.90 = 0.10. And, finally you can derive the VIF which is equal to 1/Tolerance = 1/0.10 = 10. And, such a level of VIF denotes a very high multicollinearity level. A VIF of 10 is invariably considered too high. But, many social scientists use lower (more stringent) thresholds such as a VIF of 5 or even 4.
What does that mean? When your variables suffer from a very high level of multicollinearity, it can render their regression coefficients unstable. You can test that by rerunning your model several times and truncating the data to capture different segments of the data. With high multicollinearity, you will often see regression coefficients of such variables being unstable.
Ways to solve multicollinearity? There are some really easy ways and some really hard ones. The easy way is simply remove that one variable that is multicollinear to the others. Problem solved. The hard way, consider using a Principal Component Analysis (PCA) model. You will be ready for this model by the time you complete your PhD in statistics.
So, let's say your discount variable has a VIF of 2 times. It means, the regression coefficient of this variable is 2 times higher in your regression vs. if you had used this variable as the only regressor of estimating sales.
So, within your actual regression it shows that when you use a simple multiple regression your three variables: discount, promo_media, promo_store are almost not correlated since their respective VIFs are all very close to the minimum value of 1.0.
But, when you add the interaction term variables you notice that all such variables have very high VIFs and so do the other variables. That makes perfect sense since the interaction term variables are actually composed by your original three variables. So, by definition all those variables will be very highly correlated. And, all the VIFs are above 5 that is deemed problematic.
One thing to consider... is that the way your interaction variables are structured, you will inevitably get very high VIFs. It may not necessarily suggest you have to immediately junk your whole model. If you want to keep the entire model I would check out for three things: 1) are the regression coefficients reasonably stable when you use different cut of the data; 2) are the regression coefficients of the appropriate directional sign; and 3) are all your variables regression coefficients statistically significant. If you can answer a clear "yes" on all three counts you could argue that your model is still sound and robust. Be warn that many model peer-reviewers (your professor?) may not necessarily agree with my rational. But, some of them on a good day when they are in a good mood probably will. In quantitative modeling there are a ton of debates without clear finish lines. And, the multicollinearity issue is one of them.
To further investigate your model make sure that your model with interaction variables is truly better than your simpler model. Test them on Hold Out sample to see which one predicts better using new data. Also, do the same comparison with models with just one interaction term variable, then two, then three. You most probably do not need all three interaction variables to generate descent model prediction.
Before you ask any further questions on multicollinearity I suggest you study the related article in Wikipedia in detail. This platform is not the best to provide you textbook length material with follow up comments limited to 400 characters.