Logistic Regression – How to Cross-Validate Stepwise Logistic Regression?

cross-validationfeature selectionlogistic

I have a conceptual problem understanding how to cross validate stepwise logistic regression. Every time the training set is divided it is very likely that different features are chosen based on the the penter and premove criteria. Should I cross validate using different chosen model every time or Should I find a ground truth and proceed with cross validating over that? I think the latter sounds more reasonable, but I fear that somewhere I'm compromising the test blindness.
Help is appreciated.

Best Answer

The Elements of Statistical Learning puts the answer quite clearly (second edition, p. 246):

In general, with a multistep modeling procedure, cross-validation must be applied to the entire sequence of modeling steps. In particular, samples must be “left out” before any selection or filtering steps are applied. There is one qualification: initial unsupervised screening steps can be done before samples are left out.

In this type of analysis the problem is that the "ground truth" deduced from your sample might not represent the "ground truth" in the population. Cross-validation can help with generalizing results to the population, but only if all steps of the modeling procedure are repeated for each fold of validation.

As both I and @user777 recommend, you will probably be better off if you use a method other than stepwise selection to deal with your correlated predictor variables. With highly correlated predictors, stepwise selection will almost certainly lead to highly varying choices of predictors from fold to fold. Regularization methods deal with correlated predictors much better. Ridge regression, for example, is essentially a principal-components regression with weights on the components, so that highly correlated variables tend to show up together in the same components.

Related Question