Solved – Multiple cross-validation and multiple train-test splits

cross-validationtraining errorvalidation

Suppose we have only four observations in a dataset. Let's called them a,b,c and d.

If we perform a cross-validation in a k-fold, with k=2, we would get the following :

We get two groups of data, (a,b) and (c,d). We first learn on (a,b) then validate our machine learning model on (c,d). Then we learn on (c,d) and check with (a,b).

Is this really a complete 2-fold procedure?

Because, if we shuffle the data, we could get two other groups, let's say (a,c) and (b,d).

So this time, we would need to do another 2-fold cross-validation.

So my first question is, do we really need to perform multiple k-fold cross-validation, with shuffeling the data at every stage, in order to get a really good estimate of he performances of our model?

With k=number of observation, so a leave-one-out procedure, of course there is no need to do so, since shuffeling the data will give use the same groups.

Finally, if the answer of the abose question is yes, what is the advantage of (multiple) croos-validation compared to just multiple train-test splitting?

For example, we could shuffle the dataset, then take the first 80% of the observations to train and the last 20% to test. And do this again multiple times.

Am I totally wrong or is a single k-fold not enough to assess the performance of a model? And if so, what's the difference between it and doing multiple train-test?

Thanks guys

Best Answer

Your original design is in fact a 2-fold cross validation.

As to why you do or do not need to repeat k-fold cross validation, I believe your question is a repeat of this one: How many times should we repeat a K-fold CV?