Solved – How to detect multicollinearity in a logistic regression where all the independent variables are categorical and binary

binary datacategorical datalogisticmulticollinearity

I'm doing a multivariate logistic regression where all my independent variables are categorical and binary. I have transformed all my categorical variables into dummies in order to have reference groups and interpret my odds-ratios. However, I would like to check if there are eventually multicollinearity issues.

I plan to calculate Spearman correlation between my independent variables and calculate VIF too. Nevertheless, I have several questions: Is Vif appropriate in this case? Do I have to calculate VIF and Spearman correlation on my categorical variables or their associated dummies?

I have tested other ways to detect multicollinearity (I check if the coefficients vary a lot if I increase my sample size or drop or add variables) and I'm pretty sure there is not a collinearity issue but I would like to have a "quantitative" proof.

Best Answer

Daryl Pregibon has written a key paper on LR diagnostics. It looks like it's only available from the publisher:

Daryl Pregibon. Logistic regression diagnostics. The Annals of Statistics, volume 9, pages 705–724, 1981

That said and from a practitioner's point of view, I would be comfortable using the metrics that have been developed for OLS regression such as the VIF or the eigenvalue-based collinearity index. The best source for this class of diagnostics is Belsey, Kuh and Welsch's book Regression Diagnostics: Identifying Influential Data and Sources of Collinearity:

http://www.amazon.com/Regression-Diagnostics-Identifying-Influential-Collinearity/dp/0471691178/ref=sr_1_sc_1?ie=UTF8&qid=1457527579&sr=8-1-spell&keywords=regression+diagnositcs+belsey

Of course, stringent purists or a PhD dissertation committee would likely object to leveraging these readily available and easily implemented methods but they stand as useful proxies for the real thing.