Solved – Repeated k-fold cross-validation vs. repeated holdout cross-validation: which approach is more reasonable

cross-validation

I want to split my data 100 times (1/5 as testing, 4/5 as training), and the use the training data to build a model and the testing data to calculate the MSE.

There are two ways we can do this:

Do 5-fold cross validation 20 times, i.e., each time samples are split into 5 folds, and each fold will be used as testing dataset.
Randomly choose 1/5 of the data as testing set, the other as training set. Do this 100 times.

Which one is more reasonable? Is there a theory of cross-validation that provides a reason to prefer one or the other?

Best Answer

Which method is more reasonable depends on what conclusion you exactly want to draw.

Actually, there is a 3rd possibility which differs from your version 2 by choosing the training data with replacement. This is closely related to out-of-bootstrap validation (differs only by the number of training samples you draw).

Drawing with replacement is sometimes preferred over the cross validation methods as it is closer to reality (drawing a sample in practice does not diminish the chance to draw another sample of the same characteristics again - at least as long as only a very small fraction of the true population is sampled).

I'd prefer such an out-of-bootstrap validation if I want to conclude on the model performance that can be achieved if the given algorithm is trained with $n_{train}$ cases of the given problem. (Though the caveat of Bengio, Y. and Grandvalet, Y.: No Unbiased Estimator of the Variance of K-Fold Cross-Validation Journal of Machine Learning Research, 2004, 5, 1089-1105 does also apply here: you try to extrapolate from one given data set onto other training data sets as well, and within your data set there is no way to measure how representative that data set actually is)

If, on the other hand, you want to estimate (approximately) how good the model you built on the whole data set performs on unknown data (otherwise of the same characteristics of your training data) then I'd prefer approach 1 (iterated/repeated cross validation).

Its surrogate models are a closer approximation to the model whose performance you actually want to know - so less randomness in the training data is on purpose here.
The surrogate models of iterated cross validation can be seen as perturbed (by exchanging a small fraction of the training cases) versions of each other. Thus, changes you see for the same test case can directly be attributed to model instability.

Note that whatever scheme you chose for your cross- or out-of-bootstrap validation, you only ever test as much as $n$ cases. The uncertainty caused by a finite number of test cases cannot decrease further, however many bootstrap or set validation (your approach 2) or iterations of cross validation you run.

The part of the variance that does decrease with more iterations/runs is variance caused by model instability.

In practice, we've found only small differences in total error between 200 runs of out-of-bootstrap and 40 iterations of $5$-fold cross validation for our type of data: Beleites et al.: Variance reduction in estimating classification error using sparse datasets, Chemom Intell Lab Syst, 79, 91 - 100 (2005). Note that for our high-dimensional data, resubstition/autoprediction/training error easily becomes 0, so the .632-bootstrap is not an option and there is essentially no difference between out-of-bootstrap and .632+ out-of-bootstrap.

For a study that includes repeated hold out (similar to your approach2), see Kim: Estimating classification error rate: Repeated cross-validation, repeated hold-out and bootstrap Computational Statistics & Data Analysis , 2009, 53, 3735 - 3745.

Related Solutions

Solved – Which k-fold cross-validation strategy is better

I don't quite understand your methods, but here's what I know as cross validation sub-schemes, maybe that helps you clarifying the question:

assume you have 9 samples that are ordered 1 to 9, and you're doing 3-fold CV.

block wise: the data is divided into 3 consecutive blocks:
```
case    1    2    3    4    5    6    7    8    9
fold    1    1    1    2    2    2    3    3    3
```
~~I see hardly any application where this would be useful.~~ This can be useful to answer extract hints about extrapolation behaviour: the first and the last block then tell you how the model does at extrapolating just outside the domain covered by the training data (calibration range in chemometrics).
interleaved or stripes or ventian blinds: 1st case is assigned to fold 1, 2nd to fold 2, and so on:
```
case    1    2    3    4    5    6    7    8    9
fold    1    2    3    1    2    3    1    2    3
```
This is sometimes used for (chemical) calibration. Samples are sorted with e.g. increasing concentration of the analyte. This assignment scheme guarantees that both training and test cases for the surrogate models always span the concentration range as far (and evenly spaced) as possible.
random: you assign the cases to folds in a random fashion:
```
case    1    2    3    4    5    6    7    8    9
fold    3    3    1    1    2    1    2    3    2
```
You can do that by mixing your cases, and then using one of the above schemes.

IMHO the random scheme offers a crucial advantage: you can repeat the procedure. This is known as iterated or repeated $k$-fold cross validation. The iterations help you to reduce variance that is due to instability of the surrogate models (and to measure this instability), which is not possible with the upper 2 schemes. So iterated k-fold CV is the best and it implies random assignment, unless you have specific reasons for using one of the non-random schemes.

Note that if $k = n$, all 3 schemes are the same.

Cross validation always guarantees that each sample is tested exactly once during each iteration, and used exactly $k - 1$ times for training. If your splitting scheme doesn't have this property, it is not a cross validation. There are other splitting/resampling schemes for validation, such as hold-out/set validation (as opposed to 2-fold CV), out-of-bootstrap validation, etc.

Solved – How to perform 10-fold cross validation by manually constructing datasets

Cross validation with k folds means you will have to split you data set in k disjoint groups. In your case for 10-folds you split your data set in 10 disjoint groups each with 400 samples ($G_i$ with $i$ from 1 to 10). Usually the groups should have roughly the same size.

Now do the following:

Train your classifier on $Train_1 = G_2\cup G_3 \cup ... \cup G_{10}$ and test it on $Test_1 = G_1$. Save test results for later use.
Train your classifier on $Train_2 = G_1 \cup G_3 \cup .. \cup G_{10}$ and test on $Test_2 = G_2$ and save results for later use.
Repeat another 8 steps and collect the results.

Now you have for each instance of your dataset, how it was classified, since the reunion of all $Test_i$ is the original data set (each group $G_i$ is tested once). You can measure how do you like the errors.

Now there are a couple of things which I believe you have to pay some attention. You said you have 20 target classes and 4000 samples. I do not know about your specific problem, but it does not seem to have plenty of data. So, I believe is better to do multiple cross validations and average the results, thus you decrease the chance to get too biased results.

Another thing to pay attention for is how do you build your folds. You might use simple random sampling, but I believe is better to use a stratified random procedure. Thus you increase the chances to have a usable CV estimation.

You might also consider bootstrap testing if you do not have enough instances for a 10-fold cross validation with stratified sampling.

Best Answer

Related Solutions

Solved – Which k-fold cross-validation strategy is better

Solved – How to perform 10-fold cross validation by manually constructing datasets

Related Question