Solved – Reporting variance of the repeated k-fold cross-validation

cross-validation

I have been using repeated k-fold cross validation and been reporting the mean (of the evaluation metric e.g., sensitivity, specificity) computed as the grand mean across the folds of different runs of the cross validation.

However, I am not sure how I should report the variance. I have found many questions here discussing repeated cross-validation, however, none that I am aware of explicitly answers the question of variance in repeated cross validation tests.

I understand that the total variance is due to: 1) instability of model and 2) limited sample size.

It seems that there are 4 different approaches for computing the variance for repeated k-fold cross validation:

1) the variance of the estimated average performance metric (e.g., accuracy) across runs of the cross validation be a valid estimate of the variance?

2) the pooled variance by pooling run-specific variances (which are computed across different folds of a run of cross validation test).

3) to concatenate the classification results from different fold of a cross validation run in a large vector. For instance, if the number of test data in each fold is 10 and I have a 10-fold CV, the resulting vector for a repetition will be of size 100.
Now, if I repeat my cross-validation test 10 times, I will have 10 vectors of size 100, each of which containing the classification results from a 10-fold CV run.
Now, I would compute the mean and variance as the case of single run CV.

4) I have also read in (equations 2 and 3 in 1) that the variance is the sum of external variance and the expected internal variance. If I understand correctly, the external variance is the variance of repetition-specific average performances, and the internal variance is the variance across different folds of a run of cross validation.

I would greatly appreciate your help and guidance on which variance would be the appropriate one to report for the repeated cross-validation test.

Thanks,

Best Answer

1 and 3 seem to me as invalid since they do not take into account the dependencies between repeated runs. In other words, repeated k-fold runs are more similar to each other than real repetitions of the experiment with independent data.

2 does not take into account the dependencies between folds within the same run.

I do not know about 4.

A potentially relevant (and discouraging) reference is Bengio & Grandvalet, 2004, "No Unbiased Estimator of the Variance of K-Fold Cross-Validation"