Solved – Stepwise versus L2 regularized logistic regression: dataset-specific performance

feature selectionlogisticridge regressionstepwise regression

I have two data sets from different collections. The second data set is smaller. They were both analyzed with the same methods in order to derive feature sets of 10-30 features each. Each feature set was produced the same way for both data sets.

Then, I run many Logistic Regressions to fit both data sets with all feature sets. Additionally, all of the experiments were repeated with both L2 regularized and Stepwise Logistic Regression. The observation is that the best fitting for the first data set was done with Stepwise Logistic Regression, while for the second one was done with L2 regularized Logistic Regression. This is quite consistent, i.e. for all 15 experiments on each dataset.

Why did each method perform better on either data-set? May this have to do with particular characteristics of each data-set?

For example, I know that L2 deals better with multicollinearity and lower ratio of observations/variables. Can I assume that L2 performed better on 2nd data set because it had multicollinearity? Then, L2 does not zero out any coefficients, which Stepwise in fact does. Can I say that Stepwise did better on 1st data set because it may have had some too noisy features that needed to be zeroed out?

Best Answer

What you observed follows from statistical theory and is completely expected. And "best fitting on the first dataset" is a by-product of overfitting and is not really interesting.

Why the need to reduce the number of features to 10-30? Why is parsimony good? Why not just fit an L2 penalized model with all features?