Solved – Forward search feature selection and cross-validation

cross-validationfeature selectionmachine learning

I've a question regarding forward search for feature selection. Basically, I've found here and here that the procedure is the following:

Forward Search Procedure

As the procedure suggests, the cross-validation is applied repeatedly as the feature set grows and at the end, we select the set with the best cross-validation performance.

However, in another question within the forum, it is mentioned that feature selection should be applied separately:

  1. Select one fold as the test set

  2. On the remaining folds perform feature selection

  3. Apply machine learning algorithm to remaining samples using the features selected

  4. Test whether the test set is correctly classified

  5. Go to 1.

Therefore we will end up with several best feature sets (one for each fold).

I'm confused about which is the best procedure to follow. And in case it is the second one, how do we find a unique best feature set?

Best Answer

Your second procedure assumes you have some other feature selection algorithm (for example, stepwise regression with some stopping rule), distinct from the cross-validation. If you don't have this, you'll just have to use the first procedure (where cross-validation is the whole feature-selection algorithm).

Also, even if the second procedure is applicable, the first procedure might do better. In the second procedure, a greedy feature-selection algorithm might always pick models that are overfit to the training data. Then the CV would only let you choose among these bad models. This shouldn't happen in the first procedure.

On the other hand, if your problem does have a specialized feature-selection algorithm which is computationally-efficient, then the second procedure may run much faster than the first.

If you do use the second procedure, one way to choose a best feature set is to let CV choose the model size. At every model size, you might compare different models on each data split, but average their test errors across all splits. This way, you can use CV to decide which model size gives the best estimated performance. Finally, rerun your feature-selection algorithm on the full dataset, up to the size chosen by CV, and use this as the final feature set.

Related Question