Solved – How to perform feature selection and hyperparameter optimization in cross validation

cross-validationfeature selectionhyperparametermachine learning

note: I read a lot of the questions already posted on this topic, but still have some confusion.

I want to perform feature selection and model selection for multiple models e.g. Random forest (RF), Support vector machine (SVM), lasso regression. There seem to be a few ways to do feature selection (fs) or hyper parameter optimization (hpo) through cross validation (cv). My data set is n~700 (sample size) and p = 272 (number of features). However, adding another set of features could increase p to ~20272.

My current plan is the following:

  • Run whatever resampling method (k fold or Monte carlo) to get different splits of pseudo test and training data.
  • In each iteration of resampling:
    • Run feature selection on pseudo training data
    • Increment counts for which top variables are selected
    • Train model using those features on pseudo training data
    • Get estimate for how well it does by testing on pseudo test data
  • Now we can select our feature set by taking the top k selected variables after however many iterations of resampling.
  • Using our selected feature set, run hpo for all of our models of interest in same manner as above (getting estimation of error for models with different hyperparameters trained on pseudo train data and tested on the pseudo test set, then taking the hyperparameter that leads to lowest [pseudo] test error)
  • Now we have our selected feature set and optimal hyperparameters. Last step is to build models with whatever optimal hyperparameters and selected feature set on full training data, and get error on the test set.

I have several questions:

Is it ok to do fs and hpo separately through many resampling phases?
As well, is it recommended to run fs again on overall training set after running these rounds of cv (then, the cv purpose would have been to verify that we select the same features most of the time)? Likewise, should we do hpo again on overall training data too (to verify that overall, the same hyperparams are selected), and if so, should i use something like k fold or monte carlo to get those final validation errors?

My other question is should i merge fs and hpo in the same resampling phase? If we had an initial guess for possible subsets, we can treat it like another hyper parameter, but for my case we can’t try all 2^|p| subsets so we need some kind of initial filtering. So, is it ok to do initial filtering first? If so, should i do it on an iteration of the k fold or monte carlo sampled training data? Wouldn’t this selection be bias with respect to later resampled evaluations?

Please let me know if I am being unclear or if I am doing something wrong/not recommended.

Best Answer

There is a lot in this question, but one thing for certain that you are doing incorrectly is performing feature selection on the training data and then building your model using the same training data. This could lead to very optimistically biased error rates.

Without getting too involved, I would suggest treating feature selection as one in the same as hyper parameter optimization. The choice of number of trees to use in a random forest is no different than the choice of using random forest with variables x,y,z vs. only x and y. The important thing is to do this properly using nested cross-validation which there are many questions related to on this site.

If you want to use some manner of feature selection other than a wrapper prior to the grid search in order to minimize the length of the grid search, you need to either have it self contained in its own cross validation loop or simply hold out a part of your data that is only used for feature selection.

Related Question