Solved – Cross-validation vs random sampling for classification test

bootstrapcross-validationpythonresamplingscikit learn

I usually have used cross-validation for testing classification performance. However, I read about the article that random sampling (bootstrapping) works better in many cases. I am not sure which one is better in my case.

One of my data have about 300 features and 300 instances – instances are divided into 200 training and 100 test. The class label is binary.

I want to find good features for classification. So I want to test accuracy of classifier. I ran Recursive Feature Elimination (RFE) of python sklearn, so I could get the list of 'feature importance ranking'.

In this case, among 10-fold cross-validation and random sampling,

Use 10-fold cross-validation
(or, random sampling many times)
Calculate mean accuracy of each fold
Reduce least important feature and repeat
The set of features that has highest mean accuray is used as best one.

Which one will be likely to produce better result for classification on test sets from the view of statistics?

Best Answer

If you use some kind of validation (doesn't matter which) to optimize your model (e.g. by driving the feature reduction), and particularly if you compare many models and/or optimize iteratively, you absolutely need to do a validation of the resulting final model. Whether you do that by a separate validation study, nested cross validation or nested out-of-bootstrap probably won't matter that much.
The main difference between the resampling used for cross validation and that for out-of-bootstrap is that bootstrapping resamples with replacement, while cross validation resamples without replacement. In addition, cross validation ensures that within each "run" each sample is tested exactly once.
I sometimes have questions that are more directly answered by cross repeated/iterated cross validation (stability of predictions), but:
We found repeated/iterated k-fold cross validation and out-of-bootstrap resampling having about the same total error based on equal numbers of surrogate models. I'm mostly working with vibrational spectra, 300 features would be quite typical for my data as well; but my features are highly correlated and I usually have far less independent cases (but maybe repeated measurements).
Here's the paper: Beleites, C.; Baumgartner, R.; Bowman, C.; Somorjai, R.; Steiner, G.; Salzer, R. & Sowa, M. G. Variance reduction in estimating classification error using sparse datasets, Chemom Intell Lab Syst, 79, 91 - 100 (2005).

Kim, J.-H. Estimating classification error rate: Repeated cross-validation, repeated hold-out and bootstrap , Computational Statistics & Data Analysis , 53, 3735 - 3745 (2009). DOI: 10.1016/j.csda.2009.04.009 reports similar findings.
I did not yet thoroughly read the Vanwinckelen paper @Lennart linked above; but at a first glance it looks very promising. Note that while it points out that people may be relying too much on cross validation, it does not compare cross validation vs. bootstrap-based techniques.
I also think there's often a deep misunderstanding about what the repetitions/iterations of k-fold cross valiation can do and what they cannot. Importantly, they cannot reduce the variance that is due to the limited number of independent (different) cases tested. What it can and does is: it allows to measure and reduce variance due to model instability. My understanding of the bootstrap-based resampling schemes is that they are similar in that respect.
You may want to look into how to choose a cross-validation method questions here as they typically say something about bootstrap as well. Here's a starting point: How to evaluate/select cross validation method?
Finally, a totally different thought: at the very least your data driven optimization (feature selection) should use a proper scoring rule, not accuracy. Accuracy is not "well behaved" from a statistical point of view: it has an unnecessarily high variance and on top of that it doesn't necessarily get you the best model.

Related Solutions

Solved – Recursive feature selection with cross-validation in the caret package (R)

My understanding is the "consensus ranking" is independent of the choosing of the "best" set of predictors. The rfe function finds the best predictors but as far as I know the only place to find the actual algorithm is to go through the source code. I think the author is implying that a "consensus ranking" is up to the user to do something with the variables. For example, running the code example at Feature selection: Using the caret package and showing the results of the random forest predictors:

profile.1$results

  Variables  Accuracy     Kappa  AccuracySD    KappaSD
1         1 0.9968370 0.9936464 0.007392163 0.01485547
2         2 0.9968746 0.9937256 0.009326189 0.01866587
3         3 0.9963217 0.9926185 0.009537048 0.01908711
4         4 0.9971857 0.9943537 0.006409197 0.01284846
5         5 0.9968659 0.9937105 0.007209709 0.01445173
6         6 0.9977209 0.9954207 0.006048051 0.01213925
7        20 0.9954924 0.9909603 0.009642686 0.01930148

profile.2$results

 Variables  Accuracy     Kappa AccuracySD    KappaSD
1         1 0.6483312 0.2995335 0.04698551 0.09230506
2         2 0.7723877 0.5454866 0.03916581 0.07729696
3         3 0.8274992 0.6532635 0.04604503 0.09299738
4         4 0.8388603 0.6762275 0.04361517 0.08828418
5         5 0.8309978 0.6605690 0.04846354 0.09755719
6         6 0.8242424 0.6474883 0.04556598 0.09109094
7        20 0.8005472 0.6018126 0.04871103 0.09703959

profile.3$results

 Variables  Accuracy      Kappa AccuracySD    KappaSD
1         1 0.3192818 0.05197699 0.05773080 0.07663863
2         2 0.3933106 0.13560101 0.05459624 0.07598374
3         3 0.4594806 0.22122750 0.05119101 0.06953943
4         4 0.6771564 0.53076000 0.12127578 0.17285038
5         5 0.6536151 0.49190799 0.07879014 0.11242260
6         6 0.6070402 0.42205418 0.07241226 0.10155747
7        20 0.5046387 0.25116903 0.05869522 0.07952462

profile.4$results

  Variables  Accuracy       Kappa AccuracySD    KappaSD
1         1 0.5154641 0.036353403 0.05806695 0.11057134
2         2 0.5117129 0.032926630 0.06592773 0.12742427
3         3 0.5198731 0.046944007 0.04739288 0.09231161
4         4 0.5187570 0.045917813 0.05237265 0.10100463
5         5 0.5118155 0.032686407 0.05595381 0.10829322
6         6 0.5105693 0.032829544 0.05683679 0.10436906
7        20 0.4972180 0.007899334 0.04944846 0.08724467

A consensus could be calculated on the four results using accuracy or some combinations of metrics.

Solved – k-fold cross validation overestimates test error

You're describing a scenario in which you use k-fold cross validation on the training set, and the holdout would be considered the validation set, and is not the test set.

Use the validation set to find parameters which minimize the training error, then run on the test set.

Best Answer

Related Solutions

Solved – Recursive feature selection with cross-validation in the caret package (R)

Solved – k-fold cross validation overestimates test error

Related Question