Solved – Cross-validation vs random sampling for classification test

bootstrapcross-validationpythonresamplingscikit learn

I usually have used cross-validation for testing classification performance. However, I read about the article that random sampling (bootstrapping) works better in many cases. I am not sure which one is better in my case.

One of my data have about 300 features and 300 instances – instances are divided into 200 training and 100 test. The class label is binary.

I want to find good features for classification. So I want to test accuracy of classifier. I ran Recursive Feature Elimination (RFE) of python sklearn, so I could get the list of 'feature importance ranking'.

In this case, among 10-fold cross-validation and random sampling,

  1. Use 10-fold cross-validation
  2. (or, random sampling many times)
  3. Calculate mean accuracy of each fold
  4. Reduce least important feature and repeat
  5. The set of features that has highest mean accuray is used as best one.

Which one will be likely to produce better result for classification on test sets from the view of statistics?

Best Answer

  • If you use some kind of validation (doesn't matter which) to optimize your model (e.g. by driving the feature reduction), and particularly if you compare many models and/or optimize iteratively, you absolutely need to do a validation of the resulting final model. Whether you do that by a separate validation study, nested cross validation or nested out-of-bootstrap probably won't matter that much.

  • The main difference between the resampling used for cross validation and that for out-of-bootstrap is that bootstrapping resamples with replacement, while cross validation resamples without replacement. In addition, cross validation ensures that within each "run" each sample is tested exactly once.
    I sometimes have questions that are more directly answered by cross repeated/iterated cross validation (stability of predictions), but:

  • We found repeated/iterated k-fold cross validation and out-of-bootstrap resampling having about the same total error based on equal numbers of surrogate models. I'm mostly working with vibrational spectra, 300 features would be quite typical for my data as well; but my features are highly correlated and I usually have far less independent cases (but maybe repeated measurements).
    Here's the paper: Beleites, C.; Baumgartner, R.; Bowman, C.; Somorjai, R.; Steiner, G.; Salzer, R. & Sowa, M. G. Variance reduction in estimating classification error using sparse datasets, Chemom Intell Lab Syst, 79, 91 - 100 (2005).

    Kim, J.-H. Estimating classification error rate: Repeated cross-validation, repeated hold-out and bootstrap , Computational Statistics & Data Analysis , 53, 3735 - 3745 (2009). DOI: 10.1016/j.csda.2009.04.009 reports similar findings.

  • I did not yet thoroughly read the Vanwinckelen paper @Lennart linked above; but at a first glance it looks very promising. Note that while it points out that people may be relying too much on cross validation, it does not compare cross validation vs. bootstrap-based techniques.

  • I also think there's often a deep misunderstanding about what the repetitions/iterations of k-fold cross valiation can do and what they cannot. Importantly, they cannot reduce the variance that is due to the limited number of independent (different) cases tested. What it can and does is: it allows to measure and reduce variance due to model instability. My understanding of the bootstrap-based resampling schemes is that they are similar in that respect.

  • You may want to look into how to choose a cross-validation method questions here as they typically say something about bootstrap as well. Here's a starting point: How to evaluate/select cross validation method?

  • Finally, a totally different thought: at the very least your data driven optimization (feature selection) should use a proper scoring rule, not accuracy. Accuracy is not "well behaved" from a statistical point of view: it has an unnecessarily high variance and on top of that it doesn't necessarily get you the best model.

Related Question