Solved – How to compute F-measure and accuracy for repeated cross-validation

accuracybiasclassificationcross-validation

I am working on bug classification using classifiers but facing some confusion regarding methods to compute measures that capture the predictive power of classifiers. I am running repeated cross-validation, i.e., N x K-CV.
(In each of the N runs, the dataset is randomly distributed in the K bins.)

When N=1, the paper

Forman, George, and Martin Scholz. "Apples-to-apples in cross-validation studies: pitfalls in classifier performance measurement." ACM SIGKDD Explorations Newsletter 12.1 (2010): 49-57.

recommends, that we sum true positives (TP), false positive (FP) and false negatives (FP) over the folds and compute F-measure on these aggregates (see pp. 51 in above paper). For accuracy, it recommends that one should compute accuracy in each fold, sum them and finally divide the sum by K to get overall accuracy (pp. 52).

My query is, how do I calculate the overall F-measure and accuracy for N x K-CV?

F-measure: Should I sum the quantities (i.e., TP, FP, FN) over the N x K runs and compute F-measure using these sums?
For accuracy, should we sum the accuracy for each of the N x K runs and simply take their average for the overall estimate?

Any help is highly appreciated.

Best Answer

F-measure: Should I sum the quantities (i.e., TP, FP, FN) over the N x K runs and compute F-measure using these sums?

Yes! Calculate one f1 for each run of cross-validation and average over the N runs. This is also a great opportunity to see the difference between this approach and calculating f1 for each fold and averaging over different folds differ from each other.

For accuracy, should we sum the accuracy for each of the N x K runs and simply take their average for the overall estimate?

Also Yes! Good approach. In applications, it is sometimes not about being 100% correct but applying methods and techniques according to their ease of use.

However, whenever you are reporting the mean, please also report the variance or the standard deviation.

comparison to bootstrapping:

iterations use new random splits.
the main difference is resampling with (bootstrap) or without (cv) replacement.
computational cost is about the same, as I'd choose no of iterations of cv $\approx$ no of bootstrap iterations / k, i.e. calculate the same total no of models.
bootstrap has advantages over cv in terms of some statistical properties (asymptotically correct, possibly you need less iterations to obtain a good estimate)
however, with cv you have the advantage that you are guaranteed that
- the number of distinct training samples is the same for all models (important if you want to calculate learning curves)
- each sample is tested exactly once in each iteration
some classification methods will discard repeated samples, so bootstrapping does not make sense

Variance for the performance

short answer: yes it does make sense to speak of variance in situation where only {0,1} outcomes exist.

Have a look at the binomial distribution (k = successes, n = tests, p = true probability for success = average k / n):

$\sigma^2 (k) = np(1-p)$

The variance of proportions (such as hit rate, error rate, sensitivity, TPR,..., I'll use $p$ from now on and $\hat p$ for the observed value in a test) is a topic that fills whole books...

Fleiss: Statistical Methods for Rates and Proportions
Forthofer and Lee: Biostatistics has a nice introduction.

Now, $\hat p = \frac{k}{n}$ and therefore:

$\sigma^2 (\hat p) = \frac{p (1-p)}{n}$

This means that the uncertainty for measuring classifier performance depends only on the true performance p of the tested model and the number of test samples.

In cross validation you assume

that the k "surrogate" models have the same true performance as the "real" model you usually build from all samples. (The breakdown of this assumption is the well-known pessimistic bias).
that the k "surrogate" models have the same true performance (are equivalent, have stable predictions), so you are allowed to pool the results of the k tests.
Of course then not only the k "surrogate" models of one iteration of cv can be pooled but the ki models of i iterations of k-fold cv.

Why iterate?

The main thing the iterations tell you is the model (prediction) instability, i.e. variance of the predictions of different models for the same sample.

You can directly report instability as e.g. the variance in prediction of a given test case regardless whether the prediction is correct or a bit more indirectly as the variance of $\hat p$ for different cv iterations.

And yes, this is important information.

Now, if your models are perfectly stable, all $n_{bootstrap}$ or $k \cdot n_{iter.~cv}$ would produce exactly the same prediction for a given sample. In other words, all iterations would have the same outcome. The variance of the estimate would not be reduced by the iteration (assuming $n - 1 \approx n$). In that case, assumption 2 from above is met and you are subject only to $\sigma^2 (\hat p) = \frac{p (1-p)}{n}$ with n being the total number of samples tested in all k folds of the cv.
In that case, iterations are not needed (other than for demonstrating stability).

You can then construct confidence intervals for the true performance $p$ from the observed no of successes $k$ in the $n$ tests. So, strictly, there is no need to report the variance uncertainty if $\hat p$ and $n$ are reported. However, in my field, not many people are aware of that or even have an intuitive grip on how large the uncertainty is with what sample size. So I'd recommend to report it anyways.

If you observe model instability, the pooled average is a better estimate of the true performance. The variance between the iterations is an important information, and you could compare it to the expected minimal variance for a test set of size n with true performance average performance over all iterations.

Classification – Comparing Mean Scores vs. Score Concatenation in Cross Validation

The described difference is IMHO bogus.

You'll observe it only if the distribution of truely positive cases (i.e. reference method says it is a positive case) is very unequal over the folds (as in the example) and the number of relevant test cases (the denominator of the performance measure we're talking about, here the truly positive) is not taken into account when averaging the fold averages.

If you weight the first three fold averages with $\frac{4}{12} = \frac{1}{3}$ (as there were 4 test cases among the total 12 cases which are relevant for calculation of the precision), and the last 6 fold averages with 1 (all test cases relevant for precision calculation), the weighted average is exactly the same you'd get from pooling the predictions of the 10 folds and then calculating the precision.

edit: the original question also asked about iterating/repeating the validation:

yes, you should run iterations of the whole $k$-fold cross validation procedure:
From that, you can get an idea of the stability of the predictions of your models

How much do the predictions change if the training data is perturbed by exchanging a few training samples?
I.e., how much do the predictions of different "surrogate" models vary for the same test sample?

You were asking for scientific papers:

search terms are iterated or repeated cross validation.
Papers that say "you should do this":
- Dougherty, E. R.; Sima, C.; Hua, J.; Hanczar, B. & Braga-Neto, U. M.: Performance of Error Estimators for Classification Current Bioinformatics, 2010, 5, 53-67. is a good starting point.
- For spectroscopic data, I did some simulations Beleites, C.; Baumgartner, R.; Bowman, C.; Somorjai, R.; Steiner, G.; Salzer, R. & Sowa, M. G.: Variance reduction in estimating classification error using sparse datasets. Chem.Intell.Lab.Syst., 2005, 79, 91 - 100.
  preprint
I use it regularly, e.g Beleites, C.; Geiger, K.; Kirsch, M.; Sobottka, S. B.; Schackert, G. & Salzer, R.: Raman spectroscopic grading of astrocytoma tissues: using soft reference information.Anal Bioanal Chem, 2011, 400, 2801-2816

Underestimating variance Ultimately, your data set has finite (n = 120) sample size, regardless of how many iterations of bootstrap or cross validation you do.

You have (at least) 2 sources of variance in the resampling (cross validation and out of bootstrap) validation results:
- variance due to finite number of (test) sample
- variance due to instability of the predictions of the surrogate models
If your models are stable, then
- iterations of $k$-fold cross validation were not needed (they don't improve the performance estimate: the average over each run of the cross validation is the same).
- However, the performance estimate is still subject to variance due to the finite number of test samples.
- If your data structure is "simple" (i.e. one single measurement vector for each statistically independent case), you can assume that the test results are the results of a Bernoulli process (coin-throwing) and calculate the finite-test-set variance.
out-of-bootstrap looks at variance between each surrogate model's predictions. That is possible with the cross validation results as well, but it is uncommon. If you do this, you'll see variance due to finite sample size in addition to the instability. However, keep in mind that some pooling has (usually) taken place already: for cross validation usually $\frac{n}{k}$ results are pooled, and for out-of-bootstrap a varying number of left out samples are pooled.
Which makes me personally prefer the cross validation (for the moment) as it is easier to separate instability from finite test sample sizes.

Best Answer

Related Solutions

Cross Validation – Variance Estimates in K-Fold Cross-Validation

comparison to bootstrapping:

Variance for the performance

Why iterate?

Classification – Comparing Mean Scores vs. Score Concatenation in Cross Validation

Related Question