- F-measure: Should I sum the quantities (i.e., TP, FP, FN) over the N x K runs and compute F-measure using these sums?
Yes! Calculate one f1 for each run of cross-validation and average over the N
runs. This is also a great opportunity to see the difference between this approach and calculating f1 for each fold and averaging over different folds differ from each other.
- For accuracy, should we sum the accuracy for each of the N x K runs and simply take their average for the overall estimate?
Also Yes! Good approach. In applications, it is sometimes not about being 100% correct but applying methods and techniques according to their ease of use.
However, whenever you are reporting the mean, please also report the variance or the standard deviation.
I think if the folds are of equal size, then both methods 1 and 2 will give the same mean value (or very similar if the folds are only of approximately the same size).
Personally I would tend to use the first method, because I then also have sensible estimate of the variance of the performance metric, which means I can then estimate whether a difference in performance between two models is at least statistically significant.
To expand a bit more on that, say you have two models $M_1$ and $M_2$, then you can use cross-validation to estimate the performance of each method and pick the best one. However, if you split the data into folds in a different way (different random seed) then the performance estimates are likely to be different, and could even be different enough for the ranking of the two models to be reversed. So we would like some method to see if the difference in performance is large compared to the difference caused by the random partitioning of the data to form the folds. We can do that for method 1, as we have a performance estimate from each fold, which differ only in the partitioning of the data. So the variance of those estimates is measuring that variability. For the second method, we only have one performance estimate, so we have no corresponding estimate of the uncertainty due to the sampling of the data to form folds. This is an excellent reason to use method 1 rather than method 2.
Note the variance discussed in the previous section is not the variance we would see if we had a completely new training set and new test set. The variance of the cross-validation is likely to be optimistically narrow because the training sets in each fold are not independent, but share some examples (and the test data in some folds are in the training sets of others, so that isn't completely independent either). This lack of independence has an impact on estimating the uncertainty of the performance estimate, see [Nadeau and Bengio (2003)]
Nadeau, C., Bengio, Y. Inference for the Generalization Error. Machine Learning 52, 239–281 (2003). https://doi.org/10.1023/A:1024068626366
The link to the related question means I can comment on method 3. I think I agree with @FrankHarrell in his answer that this is not an appropriate method. The point of cross-validation is to estimate not the performance of a particular model, but a method for producing a model. So the performance must be measured over the test folds using a model fitted using that method. If you average the outputs of models trained in different folds then you are not using the model that you will deploy in operation as it will not have that averaging step.
If you want to do something like that, I would opt for Bagging, where you are explicitly forming a committee or ensemble model (which has the averaging) and you can use the "out-of-bag" estimator to get a good performance estimate for the ensemble.
I have to say, I have never really liked repeated k-fold cross-validation. It seems to me that there is more symmetry in just having 100 random 90%/10% training/test splits rather than 10 lots of 10-fold cross-validation. However I don't know of a good statistical reason to prefer one over the other.
Best Answer
There is definitely a problem with selecting a case where the mean AUC is the best. You should instead report how you set up cross-validation, how many times you ran it, and include some summary statistics of the AUCs you obtained (maybe include a histogram, too).
Cross-validation gives you an estimate of how your model would perform if you train it on a random sample from your distribution (of a similar size to your training folds) on another random sample from your distribution. The variability in AUCs you observe, depending on which examples make it into the training/test sets, shows that your model is somewhat sensitive to your sample. The variance in AUCs gives you a sense of how sensitive it is.
To show why selecting a case with the best AUC is wrong, consider a case where your model is extremely sensitive to your training/test sets. It sounds like a bad model, right? But given the wide variance, on some sample if will work really, really well - by chance. You can then see how reporting just that figure would be really misleading.