Solved – Variable selection for logistic regression with separated data

feature selectionlogisticrseparation

I have a fairly large dataset ($\approx 3 \bar{M}$ observations for a dozen candidate predictors) and I would like to perform a logistic regression on that dataset.
I have a problem of separation in that dataset so usual model can't converge. That's why I am using Firth penalization (logistf package for R) to have my model to adjust.

I would like to select the best subset of variables for my final model but I can't find the proper way to do that. I know that stepwise selection is out of question and I usually would use L1 or L2 penalized regression so that some coefficients are reduced to 0.

My problem is : the function I am using to adjust my model doesn't handle extra penalization so no Elasticnet-Firth regression.

Is there, apart from stepwise selection, another way to select my variables?

Best Answer

In my experience with logistic regression where $n >> p$, using $L_1$ or $L_2$ regularization on the coefficients has little to no effect on inference. Variables that are selected by $L_1$ will usually have enormous $z$-scores - suggesting they remain in any final model. Furthermore, post-selection inference, or interpretation of $p$-values after performing variable selection, is a difficult and active research field. Since you seem to have ample data, regularization only complicates things.

Instead, standard logistic regression with careful consideration of domain knowledge, collinearity, variable transformations, and quadratic/interaction terms will go a very long way. And since you have so much data, using data splitting techniques can give you further confidence in this (labor intensive) process.