Solved – Feature selection & Cross Validation

cross-validationfeature selectionlassomachine learning

this is a popular topic here but I have been reading through the different pages and could not find anything related with what I am wondering now.

So, I have a data set with X features and I would like to obtain a subset of them that are related with the output variable (aka feature selection).
Let's say I want to use Lasso regression model with L1 regularization.

My question is: How should I apply CV in here?

Should I count the number of times a specific variable is discarded in each of the folds? At the very end I would know which of them were more discarded…
But I do not trust on this approach since I am discarding the variance of my results. I guess I should apply some hypothesis testing to have some certainty on my results, but what should I apply?

Thank you all!

Best Answer

This is a general problem of selecting Hyper parameters of a model using CV. Feature selection is just a private example. The same question arises with regularization coef selection, tree depth in selection tress, amount of layers in a NN of any other Hyper parameter.

Basically, the problem is that you can't use the same CV procedure to do both - select your Hyper parameter, and test your performance on the selected parameter. Since you are overfitting in the CV space.

The best solution is to perform nested CV, where the inner CV simulates the process of Hyper parameter selection on each train fold, while the outer CV tests the performance of the model (after the inner param selection) on each test fold.

This process will provide a good generalized CV score, regardless of which values of the Hyper parameter (or which features in your case) where selected in each train fold internally.

How do you eventually chose the features you will use? now that you have a proper non-overfitted estimation of the performance of the model, you perform the same selection process on the entire dataset. Just run the same feature selection process you used in each inner CV (on each train fold) on the entire dataset, and you have your final feature selection.

This is a known topic, you can find more here (and in a lot of other references) : nested CV answer

Related Question