K-Fold Cross Validation – Which K-Fold Cross Validation Strategy is Better?

cross-validation

I have met two k-fold cross-validation strategy. Both method divides the data set to k-folds, say k = 10. First one chooses the validation set elements randomly while the second one chooses incrementally. Let me explain this with an example:

Method 1 chooses 3 random folds in order to use as validation set and remaining 7 folds are used as training set. And repeats this operation round count times, which is a parameter whose value is smaller than C(10, 3).

Method 1: RoundCount x ( (3 x random folds) + (7 x remaining folds) )

Method 2 takes round count as 10 so that loops for ten times. In each step, takes a different fold as validation set and the rest are used as training set.

Method 2: 10 x ( (1 x incremental fold) + ( 9 x remaining folds) )

The second method guarantees that every single sample have existed in both validation and training set.

Is there an obvious advantage one of these methods to the other one?

Best Answer

I don't quite understand your methods, but here's what I know as cross validation sub-schemes, maybe that helps you clarifying the question:

assume you have 9 samples that are ordered 1 to 9, and you're doing 3-fold CV.

  • block wise: the data is divided into 3 consecutive blocks:

    case    1    2    3    4    5    6    7    8    9
    fold    1    1    1    2    2    2    3    3    3
    

    I see hardly any application where this would be useful. This can be useful to answer extract hints about extrapolation behaviour: the first and the last block then tell you how the model does at extrapolating just outside the domain covered by the training data (calibration range in chemometrics).

  • interleaved or stripes or ventian blinds: 1st case is assigned to fold 1, 2nd to fold 2, and so on:

    case    1    2    3    4    5    6    7    8    9
    fold    1    2    3    1    2    3    1    2    3
    

    This is sometimes used for (chemical) calibration. Samples are sorted with e.g. increasing concentration of the analyte. This assignment scheme guarantees that both training and test cases for the surrogate models always span the concentration range as far (and evenly spaced) as possible.

  • random: you assign the cases to folds in a random fashion:

    case    1    2    3    4    5    6    7    8    9
    fold    3    3    1    1    2    1    2    3    2
    

    You can do that by mixing your cases, and then using one of the above schemes.

IMHO the random scheme offers a crucial advantage: you can repeat the procedure. This is known as iterated or repeated $k$-fold cross validation. The iterations help you to reduce variance that is due to instability of the surrogate models (and to measure this instability), which is not possible with the upper 2 schemes. So iterated k-fold CV is the best and it implies random assignment, unless you have specific reasons for using one of the non-random schemes.

Note that if $k = n$, all 3 schemes are the same.

Cross validation always guarantees that each sample is tested exactly once during each iteration, and used exactly $k - 1$ times for training. If your splitting scheme doesn't have this property, it is not a cross validation. There are other splitting/resampling schemes for validation, such as hold-out/set validation (as opposed to 2-fold CV), out-of-bootstrap validation, etc.