I cannot reproduce exactly this phenomenon, but I can demonstrate that VIF does not necessarily increase as the number of categories increases.
The intuition is simple: categorical variables can be made orthogonal by suitable experimental designs. Therefore, there should in general be no relationship between numbers of categories and multicollinearity.
Here is an R
function to create categorical datasets with specifiable numbers of categories (for two independent variables) and specifiable amount of replication for each category. It represents a balanced study in which every combination of category is observed an equal number of times, $n$:
trial <- function(n, k1=2, k2=2) {
df <- expand.grid(1:k1, 1:k2)
df <- do.call(rbind, lapply(1:n, function(i) df))
df$y <- rnorm(k1*k2*n)
fit <- lm(y ~ Var1+Var2, data=df)
vif(fit)
}
Applying it, I find the VIFs are always at their lowest possible values, $1$, reflecting the balancing (which translates to orthogonal columns in the design matrix). Some examples:
sapply(1:5, trial) # Two binary categories, 1-5 replicates per combination
sapply(1:5, function(i) trial(i, 10, 3)) # 30 categories, 1-5 replicates
This suggests the multicollinearity may be growing due to a growing imbalance in the design. To test this, insert the line
df <- subset(df, subset=(y < 0))
before the fit
line in trial
. This removes half the data at random. Re-running
sapply(1:5, function(i) trial(i, 10, 3))
shows that the VIFs are no longer equal to $1$ (but they remain close to it, randomly). They still do not increase with more categories: sapply(1:5, function(i) trial(i, 10, 10))
produces comparable values.
Neither vifs nor stepwise tell you what is dependent on what. For that, you want condition indices. In R
you can get these from the perturb
package using the coldiag
function.
There, you first look at the condition index for those that are high (some suggest > 10, others > 30). Then, for those indices, you look at the variables that contribute a large proportion of variance.
EDIT to clarify (from colldiag documentation)
library(perturb)
data(consumption)
ct1 <- with(consumption, c(NA,cons[-length(cons)]))
m1 <- lm(cons ~ ct1+dpi+rate+d_dpi, data = consumption)
cd<-colldiag(m1)
cd
Gives
R version 3.0.2 (2013-09-25) -- "Frisbee Sailing"
Copyright (C) 2013 The R Foundation for Statistical Computing
Platform: i386-w64-mingw32/i386 (32-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
[Workspace loaded from C:/personal/abalone/.RData]
> library(perturb)
> ?coldiag
No documentation for ‘coldiag’ in specified packages and libraries:
you could try ‘??coldiag’
> ls(2)
[1] "colldiag" "perturb"
[3] "print.summary.perturb" "reclassify"
[5] "summary.perturb"
> ?colldiag
> ct1 <- with(consumption, c(NA,cons[-length(cons)]))
Error in with(consumption, c(NA, cons[-length(cons)])) :
object 'consumption' not found
> data(consumption)
> ct1 <- with(consumption, c(NA,cons[-length(cons)]))
> ct1 <- with(consumption, c(NA,cons[-length(cons)]))
> m1 <- lm(cons ~ ct1+dpi+rate+d_dpi, data = consumption)
> cd<-colldiag(m1)
> cd
Condition
Index Variance Decomposition Proportions
intercept ct1 dpi
1 1.000 0.001 0.000 0.000
2 4.143 0.004 0.000 0.000
3 7.799 0.310 0.000 0.000
4 39.406 0.263 0.005 0.005
5 375.614 0.421 0.995 0.995
rate d_dpi
1 0.000 0.002
2 0.001 0.136
3 0.013 0.001
4 0.984 0.048
5 0.001 0.814
> print(cd,fuzz=.3)
Condition
Index Variance Decomposition Proportions
intercept ct1 dpi
1 1.000 . . .
2 4.143 . . .
3 7.799 0.310 . .
4 39.406 . . .
5 375.614 0.421 0.995 0.995
rate d_dpi
1 . .
2 . .
3 . .
4 0.984 .
5 . 0.814
> cd
Condition
Index Variance Decomposition Proportions
intercept ct1 dpi rate d_dpi
1 1.000 0.001 0.000 0.000 0.000 0.002
2 4.143 0.004 0.000 0.000 0.001 0.136
3 7.799 0.310 0.000 0.000 0.013 0.001
4 39.406 0.263 0.005 0.005 0.984 0.048
5 375.614 0.421 0.995 0.995 0.001 0.814
The first column is just an identifier. The second is the condition index. The others are the proportions.
The bottom line shows clearly problematic collinearity (375 is >> 30). So, which variables are contributing? ct1 and dpi and d_dpi all have high variance decompositions; all three are contributing. You need to do something about this
The 4th line has a problematic condition index (39) but only one variable is contributing much, so there is not much to do.
Best Answer
Collinearity problems with interactions are common. Not only are interactions collinear with other interactions they are often collinear with main effects and omitted main effects. There is very little that can or should be done about this. Sometimes a variable clustering analysis can help you in understanding the problem. The bottom line: assessing interactions is a difficult problem due to lack of precision and power. Interactions are probably the most important aspect of the model to pre-specify using subject matter considerations.