Solved – Feature selection: is nested cross-validation needed

cross-validationmachine learning

I have about 150 samples 1000 features (ranked by their importance by Relieff score). My question is, what would be the best approach to:

  • choose the hyper parameters

  • choose the optimal number of features to use

  • report the accuracy of my model using SVM and kNN (I don’t intend
    to choose which one of them is best to use, but rather report their
    accuracy)

First approach: Cross Validation

  1. Split data 80% training and 20% for final testing

  2. Using training data, perform feature ranking with Relieff score

  3. Using training data, loop over the K number of features (starting
    from the most to the least important) and hyper parameters, using
    10-Fold cross validation (to computer the 10-Fold misclassification
    rate for each combination)

  4. Choose the best K (number of features) and Hyper parameters
    values, giving the least misclassification rate

  5. Train my algorithm using the training data and optimal parameters
    and test on the testing data (the 20% of my initial data, which were
    not used at all for selecting the parameters)

  6. Report accuracy

Second approach: Nested Cross Validation

  1. Split data into 10 folds (External Cross Validation)

  2. Do the same as above (Internal Cross Validation) to choose
    optimal K number of features, and hyper parameters using 10-fold
    cross validation.

  3. for each external fold, train using 9/10 of data with best chosen
    parameters and test using 1/10 of data

  4. report the average accuracy of the 10 external folds

Which one should I choose? Any suggestions?

Best Answer

The first approach is actually hold out evaluation (although CV is used for tuning) and the second approach is cross validation IF you just consider the hyperparameters (eg, the feature importance and number of features and K, etc.) to be parameters of some modeling process that you intend to evaluate using cross validation. This is explained well in How to get hyper parameters in nested cross validation?.

If conceptualized this way, the answers in Hold-out validation vs. cross-validation become directly relevant. Some major benefits:

  • If you use hold out, you "lose" the testing data (in contrast, CV allows you to make statements about the generalization error of the model trained on the full dataset, so you don't waste any data). Sample size is a major consideration here, and I think with 150 observations the recommendation would be to use CV.

  • CV with its multiple folds gives a sense of the variability of the feature selection/hyperparameter optimization process as well as some measure of variability of performance. Clearly a modeling process with 0.90 accuracy $\pm$ 0.20 is not the same as 0.90 $\pm$ 0.02.

Another method that gives similar benefits is bootstrap: see Cross-validation or bootstrapping to evaluate classification performance?. This page also discusses that accuracy even without class imbalance is a poor scoring rule.

One difficulty with CV for modeling process evaluation (eg, nested CV) is that it requires that you, expectedly, automate your entire modeling process. So, anything that is subjective or manual is pretty much out of the question. Sometimes, domain expertise can only be integrated manually. Further, hyperparameter search must be automated, which is fairly easy, but so must be the search for the hyperparameter search space. For example, if you find in some fold F that your chosen K (for kNN) is actually at the border of your search space, you might want to expand the search space. If you don't do this, your comparison between kNN and SVM will not be valid because it's possibly that you gave SVM a better search space than you gave kNN. This search space expansion can only be done within fold F; there will be leakage if you have a globally defined search space used for all the folds that you change after seeing this (see Does changing the parameter search space after nested CV introduce optimistic bias?). This might take much longer to run (and be considerably more difficult to program) than a simple hold out.

Related Question