Solved – Repeated k-fold cross-validation vs. repeated holdout cross-validation: which approach is more reasonable

cross-validation

I want to split my data 100 times (1/5 as testing, 4/5 as training), and the use the training data to build a model and the testing data to calculate the MSE.

There are two ways we can do this:

  1. Do 5-fold cross validation 20 times, i.e., each time samples are split into 5 folds, and each fold will be used as testing dataset.

  2. Randomly choose 1/5 of the data as testing set, the other as training set. Do this 100 times.

Which one is more reasonable? Is there a theory of cross-validation that provides a reason to prefer one or the other?

Best Answer

Which method is more reasonable depends on what conclusion you exactly want to draw.


Actually, there is a 3rd possibility which differs from your version 2 by choosing the training data with replacement. This is closely related to out-of-bootstrap validation (differs only by the number of training samples you draw).

Drawing with replacement is sometimes preferred over the cross validation methods as it is closer to reality (drawing a sample in practice does not diminish the chance to draw another sample of the same characteristics again - at least as long as only a very small fraction of the true population is sampled).

I'd prefer such an out-of-bootstrap validation if I want to conclude on the model performance that can be achieved if the given algorithm is trained with $n_{train}$ cases of the given problem. (Though the caveat of Bengio, Y. and Grandvalet, Y.: No Unbiased Estimator of the Variance of K-Fold Cross-Validation Journal of Machine Learning Research, 2004, 5, 1089-1105 does also apply here: you try to extrapolate from one given data set onto other training data sets as well, and within your data set there is no way to measure how representative that data set actually is)


If, on the other hand, you want to estimate (approximately) how good the model you built on the whole data set performs on unknown data (otherwise of the same characteristics of your training data) then I'd prefer approach 1 (iterated/repeated cross validation).

  • Its surrogate models are a closer approximation to the model whose performance you actually want to know - so less randomness in the training data is on purpose here.
  • The surrogate models of iterated cross validation can be seen as perturbed (by exchanging a small fraction of the training cases) versions of each other. Thus, changes you see for the same test case can directly be attributed to model instability.

Note that whatever scheme you chose for your cross- or out-of-bootstrap validation, you only ever test as much as $n$ cases. The uncertainty caused by a finite number of test cases cannot decrease further, however many bootstrap or set validation (your approach 2) or iterations of cross validation you run.

The part of the variance that does decrease with more iterations/runs is variance caused by model instability.


In practice, we've found only small differences in total error between 200 runs of out-of-bootstrap and 40 iterations of $5$-fold cross validation for our type of data: Beleites et al.: Variance reduction in estimating classification error using sparse datasets, Chemom Intell Lab Syst, 79, 91 - 100 (2005). Note that for our high-dimensional data, resubstition/autoprediction/training error easily becomes 0, so the .632-bootstrap is not an option and there is essentially no difference between out-of-bootstrap and .632+ out-of-bootstrap.

For a study that includes repeated hold out (similar to your approach2), see Kim: Estimating classification error rate: Repeated cross-validation, repeated hold-out and bootstrap Computational Statistics & Data Analysis , 2009, 53, 3735 - 3745.