Solved – How to report confusion matrix for repeated K-fold cross-validation

accuracycaretclassificationcross-validationmachine learning

I am trying to construct confusion matrices in R with CARET package for repeated K-fold cross-validation, specifically, 10-fold cross-validation with 10 repeats.

I realized there was already a similar question asked on this topic here:
How is the confusion matrix reported from K-fold cross-validation?

However, the answer to the above question does not really deal with repeated cross-validation. The solution I have come up with is to use the average of the repeated predicting probability of an observation, i.e. for each repetition, each data point is being predicted once so the final prediction of that data point is obtained by looking at the average predicting probability from all 10 repetitions.

Does this make any logical sense? Also, if this is a sensible way of doing things, how should the confidence interval of the accuracy be constructed? Furthermore, how to compare model performances based on this?

I also just realized a potential concern of the method described above.

Suppose we have a total of $N$ observations and $n$ repetitions of K-fold cross-validation. We also assume we are working with binary classification where the positive case is denoted by 1 and the negative case is denoted by 0. Multilabel classification can be done in an analogous fashion. Let $y_i$ denote the observed label of the $i$-th observation, $i = 1,…, N$, and $\hat{p}_{ij}$ be the predicted probability of the $i$-th observation being the positive case for the $j$-th repetition, $j = 1,…,n$. The final predicted label of the $i$-th observation is then
$$
\displaystyle \hat{y}_i = \mathbf{1}\{n^{-1}\sum_{j=1}^{n}\hat{p}_{ij}>1/2\},
$$
i.e. the final label is the average predicted probability of each observation from all repetitions after thresholding.

Consequently, the overall estimated accuracy (from the confusion matrix) will then be
$$
\displaystyle \text{Acc} = N^{-1}\sum_{i = 1}^{N} \hat{y}_i.
$$
This form of "ensemble" is for the purpose of ease of interpretation and the construction of sensible confusion matrices. However, there could be drawback to this approach as we would also like to have an estimate of the variance of the overall accuracy. "Collapsing" each repetition and obtaining one final label will make it very difficult to obtain an estimate of the variance. To estimate the variance, we will need to consider the overall accuracy of each repetition, i.e. for the $j$-th repetition
$$
\displaystyle \text{Acc}_j = N^{-1}\sum_{i = 1}^{N}\mathbf{1}\{\hat{p}_{ij}>1/2\},
$$
and the estimated variance will then be
$$
\displaystyle \text{Var} = \sum_{j = 1}^{n}(\text{Acc}_j-\overline{\text{Acc}})^2/(n-1),
$$
where
$$
\displaystyle \overline{\text{Acc}} = n^{-1}\sum_{j = 1}^{n}\text{Acc}_j.
$$
Clearly,
$$
\displaystyle \text{Acc} \neq \overline{\text{Acc}}.
$$
However, we would hope that these two quantities are relatively close. Ideally, one would want to show that mathematically these two quantities all converge to the true accuracy in a limiting situation. Can this be proven in any way?

Best Answer

The way I usually do this is to stack the test folds (so you get one test prediction for the entire dataset) and ensemble the train folds. If your classifier generates probabilities, this is often done by averaging the probabilities and then thresholding to get the labels. Otherwise, with a binary classifier, it’s common to use a majority rule (and with 10-fold cross-validation, you’ll have 9 train predictions per dataset entry, so majority rule will give no ties).

You compare models with their test set evaluation metric, and you use the train and test evaluation metrics to guard against overfitting.

Confidence intervals for prediction are commonly obtained through bootstrapping, but you can also do it when cross-validating. See for example https://lagunita.stanford.edu/c4x/HumanitiesScience/StatLearning/asset/cv_boot.pdf

Related Question