Appropriate way to get cross validated performance metrics

auccalibrationcross-validationlogisticpredictive-models

For cross-validation of a logistic regression classifier, it seems to me that there are a number of different approaches to calculating each performance metric:

  1. The performance metric is calculated separately in each left out fold, repeated across k folds, then averaged.
  2. The probabilities are calculated in each left out fold, repeated across k folds, concatenated into a long vector and a single performance metric calculated.
  3. Same as 2 but instead of concatenating the probabilities, store the indexes and average the predicted probabilities for each outcome instance. Calculate a single performance metric from the averaged probabilities.

This question has been asked before with respect to AUCs here. This question considered the first two options above and there was no consensus as to the correct method. The accepted answer suggested the first method was correct but the answer with the most upvotes suggested the second method and referenced a paper, "Apples-to-Apples in Cross-Validation Studies:
Pitfalls in Classifier Performance Measurement",
to support this.

Another question asks the same question and the answer suggests there is merit to both the first and second approaches.

This question is also relevant.

The issue is also addressed with reference to AUCs in Statistical Evaluation of Diagnostic Performance: Topics in ROC Analysis p204:

the machine learning community often uses other strategies to calculate the cross-validated AUC. For example, Bradley pointed out that some averaged AUCs from ROC curves correspond to each partition and others aggregated the outputs of all folds first, producing one ROC and calculating its AUC

Altogether there does not appear to be a consensus.

Another motivation for asking the question again is that I cannot find any reference that has considered how to calculate performance metrics other than the AUC, like the calibration slope (beta) or net benefit by decision curve analysis, by cross-validation. Does anyone have any advice?

Best Answer

I think if the folds are of equal size, then both methods 1 and 2 will give the same mean value (or very similar if the folds are only of approximately the same size).

Personally I would tend to use the first method, because I then also have sensible estimate of the variance of the performance metric, which means I can then estimate whether a difference in performance between two models is at least statistically significant.

To expand a bit more on that, say you have two models $M_1$ and $M_2$, then you can use cross-validation to estimate the performance of each method and pick the best one. However, if you split the data into folds in a different way (different random seed) then the performance estimates are likely to be different, and could even be different enough for the ranking of the two models to be reversed. So we would like some method to see if the difference in performance is large compared to the difference caused by the random partitioning of the data to form the folds. We can do that for method 1, as we have a performance estimate from each fold, which differ only in the partitioning of the data. So the variance of those estimates is measuring that variability. For the second method, we only have one performance estimate, so we have no corresponding estimate of the uncertainty due to the sampling of the data to form folds. This is an excellent reason to use method 1 rather than method 2.

Note the variance discussed in the previous section is not the variance we would see if we had a completely new training set and new test set. The variance of the cross-validation is likely to be optimistically narrow because the training sets in each fold are not independent, but share some examples (and the test data in some folds are in the training sets of others, so that isn't completely independent either). This lack of independence has an impact on estimating the uncertainty of the performance estimate, see [Nadeau and Bengio (2003)]

Nadeau, C., Bengio, Y. Inference for the Generalization Error. Machine Learning 52, 239–281 (2003). https://doi.org/10.1023/A:1024068626366

The link to the related question means I can comment on method 3. I think I agree with @FrankHarrell in his answer that this is not an appropriate method. The point of cross-validation is to estimate not the performance of a particular model, but a method for producing a model. So the performance must be measured over the test folds using a model fitted using that method. If you average the outputs of models trained in different folds then you are not using the model that you will deploy in operation as it will not have that averaging step.

If you want to do something like that, I would opt for Bagging, where you are explicitly forming a committee or ensemble model (which has the averaging) and you can use the "out-of-bag" estimator to get a good performance estimate for the ensemble.

I have to say, I have never really liked repeated k-fold cross-validation. It seems to me that there is more symmetry in just having 100 random 90%/10% training/test splits rather than 10 lots of 10-fold cross-validation. However I don't know of a good statistical reason to prefer one over the other.