Solved – Multicollinearity with Interaction (high VIF)

interactionmulticollinearitypanel dataregressiontime series

When I check the VIF of my independent variables with the dependent variable, it looks normal and less than 5 but when I add the interaction variables, the VIF increase to 48 for some variables.

I read that centering or creating zscore (standardized values) for the variables are two of the solutions to the problem, but I can't seem to find any article that did that.

I am doing panel data analysis with fixed model. Four continuous independent variables are interacted with a continuous variable and also with a year dummy in another model

what should I do? thank you in advance

Best Answer

Collinearity problems with interactions are common. Not only are interactions collinear with other interactions they are often collinear with main effects and omitted main effects. There is very little that can or should be done about this. Sometimes a variable clustering analysis can help you in understanding the problem. The bottom line: assessing interactions is a difficult problem due to lack of precision and power. Interactions are probably the most important aspect of the model to pre-specify using subject matter considerations.

Related Solutions

Solved – Is multicollinearity implicit in categorical variables

I cannot reproduce exactly this phenomenon, but I can demonstrate that VIF does not necessarily increase as the number of categories increases.

The intuition is simple: categorical variables can be made orthogonal by suitable experimental designs. Therefore, there should in general be no relationship between numbers of categories and multicollinearity.

Here is an R function to create categorical datasets with specifiable numbers of categories (for two independent variables) and specifiable amount of replication for each category. It represents a balanced study in which every combination of category is observed an equal number of times, $n$:

trial <- function(n, k1=2, k2=2) {
  df <- expand.grid(1:k1, 1:k2)
  df <- do.call(rbind, lapply(1:n, function(i) df))
  df$y <- rnorm(k1*k2*n)
  fit <- lm(y ~ Var1+Var2, data=df)
  vif(fit)
}

Applying it, I find the VIFs are always at their lowest possible values, $1$, reflecting the balancing (which translates to orthogonal columns in the design matrix). Some examples:

sapply(1:5, trial) # Two binary categories, 1-5 replicates per combination
sapply(1:5, function(i) trial(i, 10, 3)) # 30 categories, 1-5 replicates

This suggests the multicollinearity may be growing due to a growing imbalance in the design. To test this, insert the line

  df <- subset(df, subset=(y < 0))

before the fit line in trial. This removes half the data at random. Re-running

sapply(1:5, function(i) trial(i, 10, 3))

shows that the VIFs are no longer equal to $1$ (but they remain close to it, randomly). They still do not increase with more categories: sapply(1:5, function(i) trial(i, 10, 10)) produces comparable values.

Solved – Confused about multicollinearity, variable selection and interaction terms

Neither vifs nor stepwise tell you what is dependent on what. For that, you want condition indices. In R you can get these from the perturb package using the coldiag function.

There, you first look at the condition index for those that are high (some suggest > 10, others > 30). Then, for those indices, you look at the variables that contribute a large proportion of variance.

EDIT to clarify (from colldiag documentation)

    library(perturb)
    data(consumption)
    ct1 <- with(consumption, c(NA,cons[-length(cons)]))
    m1 <- lm(cons ~ ct1+dpi+rate+d_dpi, data = consumption)
    cd<-colldiag(m1)
    cd

Gives


R version 3.0.2 (2013-09-25) -- "Frisbee Sailing"
Copyright (C) 2013 The R Foundation for Statistical Computing
Platform: i386-w64-mingw32/i386 (32-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

[Workspace loaded from C:/personal/abalone/.RData]

> library(perturb)
> ?coldiag
No documentation for ‘coldiag’ in specified packages and libraries:
you could try ‘??coldiag’
> ls(2)
[1] "colldiag"              "perturb"              
[3] "print.summary.perturb" "reclassify"           
[5] "summary.perturb"      
> ?colldiag
> ct1 <- with(consumption, c(NA,cons[-length(cons)]))
Error in with(consumption, c(NA, cons[-length(cons)])) : 
  object 'consumption' not found
> data(consumption)
> ct1 <- with(consumption, c(NA,cons[-length(cons)]))
> ct1 <- with(consumption, c(NA,cons[-length(cons)]))
> m1 <- lm(cons ~ ct1+dpi+rate+d_dpi, data = consumption)
> cd<-colldiag(m1)
> cd
Condition
Index   Variance Decomposition Proportions
           intercept ct1   dpi  
1    1.000 0.001     0.000 0.000
2    4.143 0.004     0.000 0.000
3    7.799 0.310     0.000 0.000
4   39.406 0.263     0.005 0.005
5  375.614 0.421     0.995 0.995
  rate  d_dpi
1 0.000 0.002
2 0.001 0.136
3 0.013 0.001
4 0.984 0.048
5 0.001 0.814
> print(cd,fuzz=.3)
Condition
Index   Variance Decomposition Proportions
           intercept ct1   dpi  
1    1.000  .         .     .   
2    4.143  .         .     .   
3    7.799 0.310      .     .   
4   39.406  .         .     .   
5  375.614 0.421     0.995 0.995
  rate  d_dpi
1  .     .   
2  .     .   
3  .     .   
4 0.984  .   
5  .    0.814
> cd

Condition
Index        Variance Decomposition Proportions
           intercept ct1   dpi   rate  d_dpi
1    1.000 0.001     0.000 0.000 0.000 0.002
2    4.143 0.004     0.000 0.000 0.001 0.136
3    7.799 0.310     0.000 0.000 0.013 0.001
4   39.406 0.263     0.005 0.005 0.984 0.048
5  375.614 0.421     0.995 0.995 0.001 0.814

The first column is just an identifier. The second is the condition index. The others are the proportions.

The bottom line shows clearly problematic collinearity (375 is >> 30). So, which variables are contributing? ct1 and dpi and d_dpi all have high variance decompositions; all three are contributing. You need to do something about this

The 4th line has a problematic condition index (39) but only one variable is contributing much, so there is not much to do.

Best Answer

Related Solutions

Solved – Is multicollinearity implicit in categorical variables

Solved – Confused about multicollinearity, variable selection and interaction terms

Related Question