Logistic Regression – Multicollinearity Concerns and Pitfalls

logisticmulticollinearityregression

In Logistic Regression, is there a need to be as concerned about multicollinearity as you would be in straight up OLS regression?

For example, with a logistic regression, where multicollinearity exists, would you need to be cautious (as you would in OLS regression) with taking inference from the Beta coefficients?

For OLS regression one "fix" to high multicollinearity is ridge regression, is there something like that for logistic regression? Also, dropping variables, or combining variables.

What approaches are reasonable for reducing the effects of multicollinearity in a logistic regression? Are they essentially the same as OLS?

(Note: this is not for the purpose of a designed experiment)

Best Answer

All of the same principles concerning multicollinearity apply to logistic regression as they do to OLS. The same diagnostics assessing multicollinearity can be used (e.g. VIF, condition number, auxiliary regressions.), and the same dimension reduction techniques can be used (such as combining variables via principal components analysis).

This answer by chl will lead you to some resources and R packages for fitting penalized logistic models (as well as a good discussion on these types of penalized regression procedures). But some of your comments about "solutions" to multicollinearity are a bit disconcerting to me. If you only care about estimating relationships for variables that are not collinear these "solutions" may be fine, but if your interested in estimating coefficients of variables that are collinear these techniques do not solve your problem. Although the problem of multicollinearity is technical in that your matrix of predictor variables can not be inverted, it has a logical analog in that your predictors are not independent, and their effects cannot be uniquely identified.

Related Solutions

Solved – Binary Logistic Regression Multicollinearity Tests

I'm glad you like my answer :-)

It's not that there is no valid method of detecting collinearity in logistic regression: Since collinearity is a relationship among the independent variables, the dependent variable doesn't matter.

What is problematic is figuring out how much collinearity is too much for logistic regression. David Belslely did extensive work with condition indexes. He found that indexes over 30 with substantial variance accounted for in more than one variable was indicative of collinearity that would cause severe problems in OLS regression. However, "severe" is always a judgment call. Perhaps the easiest way to see the problems of collinearity is to show that small changes in the data make big changes in the results.

[this paper http://www.medicine.mcgill.ca/epidemiology/joseph/courses/epib-621/logconfound.pdf] offers examples of collinearity in logistic regression. It even shows that R detects exact collinearity, and, in fact, some cases of approximate collinearity will cause the same warning:

Warning message:
glm.fit: fitted probabilities numerically 0 or 1 occurred

Nevertheless, we can ignore this warning and run

set.seed(1234)
x1 <- rnorm(100)
x2 <- rnorm(100)
x3 <- x1 + x2 + rnorm(100, 0, 1)

y <- x1 + 2*x2 + 3*x3 + rnorm(100)
ylog <- cut(y, 2, c(1,0))

m1<- glm(ylog~x1+x2+x3, family = binomial)
coef(m1)

which yields -2.55, 1.97, 5.60 and 12.54

We can then slightly perturb x1 and x2, add them for a new x3 and run again:

x1a <- x1+rnorm(100,0,.01)
x2a <- x2+rnorm(100,0, .01)
x3a <- x1a + x2a + rnorm(100, 0, 1)

ya <- x1a + 2*x2a + 3*x3a + rnorm(100)
yloga <- cut(ya, 2, c(1,0))


m2<- glm(ylog~x1a+x2a+x3a, family = binomial)
coef(m2)

this yields wildly different coefficients: 0.003, 3.012, 3.51 and -0.41

and yet, this set of independent variables does not have a high condition index:

library(perturb)
colldiag(m1)

says the maximum condition index is 3.54.

I am unaware if anyone has done any Monte Carlo studies of this; if not, it seems a good area for research

Solved – Identifying multicollinearity of categorical variables in a logistic regression

The VIF has been generalized to deal with logistic regression (assuming you mean a model with a binary dependent variable). In R, you can do this using the vif function in the car package.

As @RichardHardy has said, it is not a test though. At the end you will get some GVIFs and still need to make some subjective decisions. The thing to keep in mind is that if you have high VIFs, it means that your standard errors will be inflated from some of your estimates, so results that could be meaningful may not be detected as being significant. The books and writings by John Fox, who also co-wrote the car package, are a great resource for understanding multicollinearity.

Best Answer

Related Solutions

Solved – Binary Logistic Regression Multicollinearity Tests

Solved – Identifying multicollinearity of categorical variables in a logistic regression

Related Question