If you are using R, SPSS or Stata, you can look at the perturb
package. It diagnoses collinearity by adding random noise to continuous variables; for categorical variables, some are changed to different categories.
In the documentation for perturb
in R, it notes that the model need not be lm
, implying that any model (including ones built with optimal scaling or ordinal logistic) could be used.
I'm glad you like my answer :-)
It's not that there is no valid method of detecting collinearity in logistic regression: Since collinearity is a relationship among the independent variables, the dependent variable doesn't matter.
What is problematic is figuring out how much collinearity is too much for logistic regression. David Belslely did extensive work with condition indexes. He found that indexes over 30 with substantial variance accounted for in more than one variable was indicative of collinearity that would cause severe problems in OLS regression. However, "severe" is always a judgment call. Perhaps the easiest way to see the problems of collinearity is to show that small changes in the data make big changes in the results.
[this paper http://www.medicine.mcgill.ca/epidemiology/joseph/courses/epib-621/logconfound.pdf] offers examples of collinearity in logistic regression. It even shows that R
detects exact collinearity, and, in fact, some cases of approximate collinearity will cause the same warning:
Warning message:
glm.fit: fitted probabilities numerically 0 or 1 occurred
Nevertheless, we can ignore this warning and run
set.seed(1234)
x1 <- rnorm(100)
x2 <- rnorm(100)
x3 <- x1 + x2 + rnorm(100, 0, 1)
y <- x1 + 2*x2 + 3*x3 + rnorm(100)
ylog <- cut(y, 2, c(1,0))
m1<- glm(ylog~x1+x2+x3, family = binomial)
coef(m1)
which yields -2.55, 1.97, 5.60 and 12.54
We can then slightly perturb x1 and x2, add them for a new x3 and run again:
x1a <- x1+rnorm(100,0,.01)
x2a <- x2+rnorm(100,0, .01)
x3a <- x1a + x2a + rnorm(100, 0, 1)
ya <- x1a + 2*x2a + 3*x3a + rnorm(100)
yloga <- cut(ya, 2, c(1,0))
m2<- glm(ylog~x1a+x2a+x3a, family = binomial)
coef(m2)
this yields wildly different coefficients: 0.003, 3.012, 3.51 and -0.41
and yet, this set of independent variables does not have a high condition index:
library(perturb)
colldiag(m1)
says the maximum condition index is 3.54.
I am unaware if anyone has done any Monte Carlo studies of this; if not, it seems a good area for research
Best Answer
You could transform your categorial variables into one-hot encoded dummy variables before doing what @Adam-Quek suggested:
Using such, you could use regular tools again (feature correlation, PCA, ...), like Adam suggested: