Solved – Binary Logistic Regression Multicollinearity Tests

logisticmulticollinearityregressionvariance-inflation-factor

I like Peter Flom's answer to an earlier question about multicollinearity in logistic regression, but David Garson's Logistic Binomial Regression states that there is no valid test for multicollinearity for binary-dependent logistic regression, even if the independent variables are ratio scale. Can anyone supply one or more references? My own experience is that OLS correlation matrices and VIF worked for me, as my logistic coefficients went haywire before removing entangled independent variables based on the OLS tests for multicollinearity. But I have to publish my results and methods, and would like a reputable way to cite the practice, if one or more exist.

Best Answer

I'm glad you like my answer :-)

It's not that there is no valid method of detecting collinearity in logistic regression: Since collinearity is a relationship among the independent variables, the dependent variable doesn't matter.

What is problematic is figuring out how much collinearity is too much for logistic regression. David Belslely did extensive work with condition indexes. He found that indexes over 30 with substantial variance accounted for in more than one variable was indicative of collinearity that would cause severe problems in OLS regression. However, "severe" is always a judgment call. Perhaps the easiest way to see the problems of collinearity is to show that small changes in the data make big changes in the results.

[this paper http://www.medicine.mcgill.ca/epidemiology/joseph/courses/epib-621/logconfound.pdf] offers examples of collinearity in logistic regression. It even shows that R detects exact collinearity, and, in fact, some cases of approximate collinearity will cause the same warning:

Warning message:
glm.fit: fitted probabilities numerically 0 or 1 occurred

Nevertheless, we can ignore this warning and run

set.seed(1234)
x1 <- rnorm(100)
x2 <- rnorm(100)
x3 <- x1 + x2 + rnorm(100, 0, 1)

y <- x1 + 2*x2 + 3*x3 + rnorm(100)
ylog <- cut(y, 2, c(1,0))

m1<- glm(ylog~x1+x2+x3, family = binomial)
coef(m1)

which yields -2.55, 1.97, 5.60 and 12.54

We can then slightly perturb x1 and x2, add them for a new x3 and run again:

x1a <- x1+rnorm(100,0,.01)
x2a <- x2+rnorm(100,0, .01)
x3a <- x1a + x2a + rnorm(100, 0, 1)

ya <- x1a + 2*x2a + 3*x3a + rnorm(100)
yloga <- cut(ya, 2, c(1,0))


m2<- glm(ylog~x1a+x2a+x3a, family = binomial)
coef(m2)

this yields wildly different coefficients: 0.003, 3.012, 3.51 and -0.41

and yet, this set of independent variables does not have a high condition index:

library(perturb)
colldiag(m1)

says the maximum condition index is 3.54.

I am unaware if anyone has done any Monte Carlo studies of this; if not, it seems a good area for research