Machine Learning – How to Use Nested Cross-Validation for Feature Selection

cross-validationfeature selectionmachine learningmodel selection

The interaction between model complexity and it's ability to fit a given dataset is crucial for model selection.

This thread discusses this for model selection and the same answer is suggested to apply to preprocessing.

Now considering feature selection as a preprocessing step we can imagine using nested CV where the inner loop employs some feature selection scheme. However this paper discourages use of "double cross-validation" in this way. Admittedly this is different than nested CV but the argument put forward by the authors seems to extend to nested CV.

For example, if we use a variable selection method and k-nearest neighbourhood, then both the number of selected variables and number of neighbours, k, directly affect model complexity. Therefore, in step 1 in the external loop we might choose different k for different L’ and for a fixed number of variables end up averaging over models with different model complexities.

Basically we can not make a decision regarding which features to use in the nested CV scheme. However, I feel that the errors from this can be still used as a valid generalization error estimate, stating that a model built using this particular procedure of feature selection and parameter estimation is expected to have such as such error on new data.

If this makes sense then the same argument can be extended to other choices such as learning algorithm.

The question is does this make sense?

Best Answer

Nested cross validation can indeed include feature selection. It is similar to preprocessing as you have mentioned.

I feel that the errors from this can be still used as a valid generalization error estimate, stating that a model built using this particular procedure of feature selection and parameter estimation is expected to have such as such error on new data.

This is the idea as nested cv won't produce a single model. So, it makes sense. And it can be extended to other concepts. Though flat cross validation can be preferred in case we also want to get a final model.

I understand the concerns of the authors here:

... for a fixed number of variables end up averaging over models with different model complexities

But, this actually aims to answer the question "What can we do with this set of features?"