Solved – Should repeated cross-validation be used to assess predictive models

cross-validation

I came across this 2012 article by Gitte Vanwinckelen and Hendrik Blockeel calling into question the utility of repeated cross-validation, which has become a popular technique for reducing the variance of cross-validation.

The authors demonstrated that while repeated cross-validation does decrease the variance of model predictions, since the same sample dataset is being resampled the mean of the resampled cross-validation estimates converges to a biased estimate of the true predictive accuracy and hence is not useful.

Should repeated cross-validation be used despite these limitations?

Best Answer

The argument that the paper seems to be making appears strange to me.

According to the paper, the goal of CV is to estimate $\alpha_2$, the expected predictive performance of the model on new data, given that the model was trained on the observed dataset $S$. When we conduct $k$-fold CV, we obtain an estimate $\hat A$ of this number. Because of the random partitioning of $S$ into $k$ folds, this is a random variable $\hat A \sim f(A)$ with mean $\mu_k$ and variance $\sigma^2_k$. In contrast, $n$-times-repeated CV yields an estimate with the same mean $\mu_k$ but smaller variance $\sigma^2_k/n$.

Obviously, $\alpha_2\ne \mu_k$. This bias is something we have to accept.

However, the expected error $\mathbb E\big[|\alpha_2-\hat A|^2\big]$ will be larger for smaller $n$, and will be the largest for $n=1$, at least under reasonable assumptions about $f(A)$, e.g. when $\hat A\mathrel{\dot\sim} \mathcal N(\mu_k,\sigma^2_k/n)$. In other words, repeated CV allows to get a more precise estimate of $\mu_k$ and it is a good thing because it gives a more precise estimate of $\alpha_2$.

Therefore, repeated CV is strictly more precise than non-repeated CV.

The authors do not argue with that! Instead they claim, based on the simulations, that

reducing the variance [by repeating CV] is, in many cases, not very useful, and essentially a waste of computational resources.

This just means that $\sigma^2_k$ in their simulations was pretty low; and indeed, the lowest sample size they used was $200$, which is probably big enough to yield small $\sigma^2_k$. (The difference in estimates obtained with non-repeated CV and 30-times-repeated CV is always small.) With smaller sample sizes one can expect larger between-repetitions variance.

CAVEAT: Confidence intervals!

Another point that the authors are making is that

the reporting of confidence intervals [in repeated cross-validation] is misleading.

It seems that they are referring to confidence intervals for the mean across CV repetitions. I fully agree that this is a meaningless thing to report! The more times CV is repeated, the smaller this CI will be, but nobody is interested in the CI around our estimate of $\mu_k$! We care about the CI around our estimate of $\alpha_2$.

The authors also report CIs for the non-repeated CV, and it's not entirely clear to me how these CIs were constructed. I guess these are the CIs for the means across the $k$ folds. I would argue that these CIs are also pretty much meaningless!

Take a look at one of their examples: the accuracy for adult dataset with NB algorithm and 200 sample size. They get 78.0% with non-repeated CV, CI (72.26, 83.74), 79.0% (77.21, 80.79) with 10-times-repeated CV, and 79.1% (78.07, 80.13) with 30-times-repeated CV. All of these CIs are useless, including the first one. The best estimate of $\mu_k$ is 79.1%. This corresponds to 158 successes out of 200. This yields 95% binomial confidence interval of (72.8, 84.5) -- broader even than the first one reported. If I wanted to report some CI, this is the one I would report.

MORE GENERAL CAVEAT: variance of CV.

You wrote that repeated CV

has become a popular technique for reducing the variance of cross-validation.

One should be very clear what one means by the "variance" of CV. Repeated CV reduces the variance of the estimate of $\mu_k$. Note that in case of leave-one-out CV (LOOCV), when $k=N$, this variance is equal to zero. Nevertheless, it is often said that LOOCV has actually the highest variance of all possible $k$-fold CVs. See e.g. here: Variance and bias in cross-validation: why does leave-one-out CV have higher variance?

Why is that? This is because LOOCV has the highest variance as an estimate of $\alpha_1$ which is the expected predictive performance of the model on new data when built on a new dataset of the same size as $S$. This is a completely different issue.