Solved – How to avoid collinearity of categorical variables in logistic regression

logisticmulticollinearitymultiple regressionregression

I have the following problem: I'm performing a multiple logistic regression on several variables each of which has a nominal scale. I want to avoid multicollinearity in my regression. If the variables were continuous I could compute the variance inflation factor (VIF) and look for variables with a high VIF. If the variables were ordinarily scaled I could compute Spearman's rank correlation coefficients for several pairs of variables and compare the computed value with a certain threshold. But what do I do if the variables are just nominally scaled? One idea would be to perform a pairwise chi-square test for independence, but the different variables don't all have the same co-domains. So that would be another problem. Is there a possibility for solving this problem?

Best Answer

I would second @EdM's comment (+1) and suggest using a regularised regression approach.

I think that an elastic-net/ridge regression approach should allow you to deal with collinear predictors. Just be careful to normalise your feature matrix $X$ appropriately before using it, otherwise you will risk regularising each feature disproportionately (yes, I mean the $0/1$ columns, you should scale them such that each column has unit variance and mean $0$).

Clearly you would have to cross-validate your results to ensure some notion of stability. Let me also note that instability is not a huge problem because it actually suggests that there is not obvious solution/inferential result and simply interpreting the GLM procedure as "ground truth" is incoherent.