Solved – Multicollinearity between two categorical variables

multicollinearityr

Is Variance inflation factor(VIF) also applicable in order to test multicollinearity in between two categorical variables? What is the use of the Spearman test? How to do this on R?

Best Answer

Generalized VIF is your friend. See example:

data1<-data.frame(
  y = rnorm(8),
  x1 = factor(LETTERS[c(1,1,1,2,2,2,3,3)]),
  x2 = factor(letters[c(1,1,2,1,2,3,2,3)]),
  x3 = factor(rep(c('one','two'),4))
)

data1

            y x1 x2  x3
1 -1.20757109  A  a one
2 -0.92517490  A  a two
3 -1.97064426  A  b one
4  0.91072507  B  a two
5  0.82909639  B  b one
6  0.04714072  B  c two
7 -1.00678648  C  b one
8 -0.08177810  C  c two

library(car)

vif(lm(y~x1+x2+x3, data=data1))

       GVIF Df GVIF^(1/(2*Df))
x1 1.800000  2        1.158292
x2 4.950000  2        1.491596
x3 3.466667  1        1.861899

And read about GVIF in ?vif:

If all terms in an unweighted linear model have 1 df, then the usual variance-inflation factors are calculated.

If any terms in an unweighted linear model have more than 1 df, then generalized variance-inflation factors (Fox and Monette, 1992) are calculated. These are interpretable as the inflation in size of the confidence ellipse or ellipsoid for the coefficients of the term in comparison with what would be obtained for orthogonal data.

The generalized vifs are invariant with respect to the coding of the terms in the model (as long as the subspace of the columns of the model matrix pertaining to each term is invariant). To adjust for the dimension of the confidence ellipsoid, the function also prints GVIF^[1/(2*df)] where df is the degrees of freedom associated with the term.

Related Solutions

Solved – Is multicollinearity implicit in categorical variables

I cannot reproduce exactly this phenomenon, but I can demonstrate that VIF does not necessarily increase as the number of categories increases.

The intuition is simple: categorical variables can be made orthogonal by suitable experimental designs. Therefore, there should in general be no relationship between numbers of categories and multicollinearity.

Here is an R function to create categorical datasets with specifiable numbers of categories (for two independent variables) and specifiable amount of replication for each category. It represents a balanced study in which every combination of category is observed an equal number of times, $n$:

trial <- function(n, k1=2, k2=2) {
  df <- expand.grid(1:k1, 1:k2)
  df <- do.call(rbind, lapply(1:n, function(i) df))
  df$y <- rnorm(k1*k2*n)
  fit <- lm(y ~ Var1+Var2, data=df)
  vif(fit)
}

Applying it, I find the VIFs are always at their lowest possible values, $1$, reflecting the balancing (which translates to orthogonal columns in the design matrix). Some examples:

sapply(1:5, trial) # Two binary categories, 1-5 replicates per combination
sapply(1:5, function(i) trial(i, 10, 3)) # 30 categories, 1-5 replicates

This suggests the multicollinearity may be growing due to a growing imbalance in the design. To test this, insert the line

  df <- subset(df, subset=(y < 0))

before the fit line in trial. This removes half the data at random. Re-running

sapply(1:5, function(i) trial(i, 10, 3))

shows that the VIFs are no longer equal to $1$ (but they remain close to it, randomly). They still do not increase with more categories: sapply(1:5, function(i) trial(i, 10, 10)) produces comparable values.

Solved – How to test for multicollinearity among dumthe explanatory variables

The VIF is probably the best way to go here. The Pearson correlation will give you a lousy measure here because it behaves somewhat weirdly for categorical variables like this. Another possibility is to use a matrix of a different measure like cosine similarity: $\sum x_i*x_j / \sqrt{\sum x_i^2 * \sum x_j^2}$. I think that is equivalent to Spearman's Rho or Kendall's Tau but am not sure off the top of my head.

I'd stick to the VIF though because it will tell you for each variable whether the other variables combined are highly colinear. But if you want a visual diagnostic of which pairwise variables are similar, those other metrics are better than Pearson for categorical data.

----EDIT---

Sure. This has to do primarily with the fact that Pearson's correlation can swing up or down or go negative very easily. Here's an example:

> cor(c(0,1,1,1,0,1,0,1,0),c(1,1,0,1,1,0,1,1,0))
[1] -0.1581139
> cor(c(0,1,1,1,0,1,0,1,0),c(0,1,0,1,1,0,1,1,0))
[1] 0.1

Here, by changing just one of the entries to zero we have swung the correlation from positive to negative. But the VIF uses $1/(1-R_{i}^2)$ where the $R_{i}^2$ is for the regression of the other variables on the one in question. I would have to work it out but I think that is basically a linear combination of something similar to the cosine measure I posted above, or a transform of it. Essentially though, it can't go negative.

I don't know any literature on it off the top of my head, but I will think about it.

Best Answer

Related Solutions

Solved – Is multicollinearity implicit in categorical variables

Solved – How to test for multicollinearity among dumthe explanatory variables

Related Question