Solved – the benefit of only performing a random split once at the beginning during Kfold cross-validation

cross-validationscikit learn

I had a discussion with my colleague about Kfold and StratifiedKfold validation in scikit learn.

My understanding of these two ways of validation is that at the beginning of the validation process, the data set is randomly divided in k-folds. k-1 of those folds end up in the train dataset. 1 of those folds ends up in the validation data set. Validations are performed k times. With each repetition, what is in the training or validation dataset is shifted.

My colleague argued that it would be better to decide what is and is not in the train dataset or validation dataset anew. Edit: Scikit-learn also has selectors for this sort of validation named "ShuffleSplit" and "StratifiedSHuffling".

I had no mathematical intuition for why one way is superior to another way.

The benefit of only doing the initial split is that what ends up in the validation dataset is non-redundant. But is that really a benefit?

So my question is this:

Is he right? What is the benefit (intuitively or mathematically) to the actual implementation of Kfold or StratifiedKfold?

Edit: Can someone provide me with an intuition for what to pick when?

Best Answer

The benefit of determining the contents of each fold at the beginning rather than re-sampling is that you avoid bias by possibly selecting a single observation for more than one training or testing set. This is only a benefit if you care that no records are over-represented in your validation.

If the argument is that new folds should be generated without replacement after each fold has been tested, it can be shown that the likelihood of a single observation to appear in each fold is unaffected and they are therefore equivalent approaches.

Stratified methods most commonly used in cases where classes are unbalanced - to avoid, for example, folds where there are no (or very few, or just inefficiently few) positive examples in a sparse binary classification problem. This applies both to StratifiedShuffleSplit and StratifiedKFold methods.

In order to determine which to use between K-Fold and ShuffleSplit, though, we'll have to understand some key differences between the methods.

  1. In K-fold, the model is trained on each iteration at a proportion of the data set equal to $\frac{k-1}{k}$, i.e. for $k=5$ the model is trained on 80% of the data at each iteration and for $k=10$ the model is trained on 90% of the data. There are $k$ training iterations in the algorithm.

    In ShuffleSplit, the model is trained at each iteration on a defined train_size. The default size of the training set for the scikit-learn implementation is 90%. The number of iterations is parameterized (n_splits).

    You can therefore configure two validation strategies that have the same number of training runs and are trained on the same proportion of the data, I.e. K-fold validation with K=10 and ShuffleSplit with train_size = .9 and n_splits = 10.

No difference here.

  1. K-Fold is guaranteed to both train and test on every observation of the dataset an equal number of times ($k-1$ and 1 times, respectively).

    ShuffleSplit does not guarantee this proportion- by re-sampling at each iteration, duplicate members of the test set can be selected twice, or several times. Some observations may not be represented in the test set.

This is a point for using K-Fold.

Some other discussion has brought up training time as a point in favor of ShuffleSplit- that is, you can configure ShuffleSplit to run with less demanding coverage (i.e. train_size = .8, n_splits = 5). This is only a strong argument for complexities that can't be approximated by adjusting the parameter $k$- as $k=5$ for the parameters mentioned above for ShuffleSplit.

For example, K-fold can't approximate train_size = .8, n_splits = 3. The tradeoff is that each configuration of ShuffleSplit that would be more computationally efficient than K-Fold also provides less complete validation coverage- in this example (train_size = .8, n_splits = 3), the model could never be tested on more than 60% of the data, and would almost always be tested on less than that.

TL;DR-
K-Fold is generally better except in edge cases where the computational demand of K-Fold isn't supported by a need for comprehensive cross-validation coverage.