Machine Learning – How Repeated k-Fold Cross Validation Identifies Model Instability

bias-variance tradeoffcross-validationmachine learningmodel selection

In these threads 1,2,3, cbeleites mentions that in a single k-fold cross validation you cannot tell whether the variance is caused by model instability or using a different test set. Hence, one can perform repeated k-fold to get a measure of model instability.

My question is:

  1. Model stability can be thought of as getting the same predictions for a fixed set of test data as you train your model on slightly different data. How does repeated k-fold cross validation provide a measure of this as your training and test data is still random as you repeat it? You are training on different data, but your test data is also different, so the relationship between the two is not clear to me. Does anyone have a simple and intuitive explanation for this?

Best Answer

Here's the trick:

Each case is tested in some (exactly one) fold in each run (iteration, repetition) of the cross validation. After b runs, we have trained a total of bk surrogate models. Out of which, b were used to test/predict a (one) given case.

Each of those b predictions comes from a different surrogate model, but since we're looking at only one case (at a time), any difference in the b predictions must be caused by differences in the surrogate models. These differences in turn are the reaction to the training data of these b surrogate models being slightly different, because in addition to the case in question, some more cases were excluded from training - but which ones exactly differs between the b surrogate models.

We can thus say that we measure variation in prediction caused by exchanging a few training cases against other training cases (for deterministic training algorithms, if the training algorithm has a random part, that variance comes on top - but we can also measure that by repeatedly training with exactly the same training data).

  • For each case, we thus look at the variation across b out of the bk surrogate models trained in total
  • Another case will also have b predictions by b different surrogate models, but those surrogate models will usually not be the same as the b surrogate models that predicted some other case.

Here's an illustration:

repeated k-fold cross validation

On the left are the data, the triangles symbolize the surrogate models and on the right we have predictions and whether they are correct.

E.g. look at case 2 (of class A). It is test case in fold 3 of iteration 1, fold 1 of iteration 2 and fold 1 of iteration 3.

In iteration 1 fold 3, it is left out together with cases 1 and 9,
in iteration 2 fold 1, it is left out together with cases 3 and 6, and
in iteration 3 fold 1, it is left out together with cases 1 and 6

So all the difference in the surrogate models stems from exchanging 1 or 2 (in general: up to n/k - 1) training cases.

Using how much the model reacts to a slightly different training sets (produced by exchanging a few training cases) is one possible way of defining model stability.

(I tend to think of stability (or ruggedness in analytical chemistry) not as an absolute characteristic but stability against particular influencing factors or perturbations.)

Update to answer comment: I do keep track of all bn predictions, yes (at least until I'm done analyzing the model in question). But then, I'm usually in a low sample size situation. That is, I may have many data rows, but there's structure on the data causing dependence (e.g. repeated measurements or the like).
Anyways, there are typically some easy cases that even unstable models will always get right - and more difficult cases where instability shows.
You need to evaluate a sufficient number of cases to be representative. If only a fraction of your data is needed for this, you wouldn't have needed to do cross validation in the first place, e.g. training a couple of surrogate modely and testing them with a fixed test set would then have been sufficient. Even more so, since estimating a mean (like generalization error) typically needs fewer samples than estimating variances.

If your model/training is stable, b doesn't need to be large. You can start with a few runs, and if everything is nice and stable, you're done. If not, you may want to do more runs, because repetitions help the generalization error estimate iff training is unstable. OTOH, in that situation you may consider that you need to go back one step and change the trainng procedure so that you can obtain stable models with the data you have.