Very interesting question, I'll have to read the papers you give... But maybe this will start us in direction of an answer:
I usually tackle this problem in a very pragmatic way: I iterate the k-fold cross validation with new random splits and calculate performance just as usual for each iteration. The overall test samples are then the same for each iteration, and the differences come from different splits of the data.
This I report e.g. as the 5th to 95th percentile of observed performance wrt. exchanging up to $\frac{n}{k} - 1$ samples for new samples and discuss it as a measure for model instability.
Side note: I anyways cannot use formulas that need the sample size. As my data are clustered or hierarchical in structure (many similar but not repeated measurements of the same case, usually several [hundred] different locations of the same specimen) I don't know the effective sample size.
comparison to bootstrapping:
iterations use new random splits.
the main difference is resampling with (bootstrap) or without (cv) replacement.
computational cost is about the same, as I'd choose no of iterations of cv $\approx$ no of bootstrap iterations / k, i.e. calculate the same total no of models.
bootstrap has advantages over cv in terms of some statistical properties (asymptotically correct, possibly you need less iterations to obtain a good estimate)
however, with cv you have the advantage that you are guaranteed that
- the number of distinct training samples is the same for all models (important if you want to calculate learning curves)
- each sample is tested exactly once in each iteration
some classification methods will discard repeated samples, so bootstrapping does not make sense
Variance for the performance
short answer: yes it does make sense to speak of variance in situation where only {0,1} outcomes exist.
Have a look at the binomial distribution (k = successes, n = tests, p = true probability for success = average k / n):
$\sigma^2 (k) = np(1-p)$
The variance of proportions (such as hit rate, error rate, sensitivity, TPR,..., I'll use $p$ from now on and $\hat p$ for the observed value in a test) is a topic that fills whole books...
- Fleiss: Statistical Methods for Rates and Proportions
- Forthofer and Lee: Biostatistics has a nice introduction.
Now, $\hat p = \frac{k}{n}$ and therefore:
$\sigma^2 (\hat p) = \frac{p (1-p)}{n}$
This means that the uncertainty for measuring classifier performance depends only on the true performance p of the tested model and the number of test samples.
In cross validation you assume
that the k "surrogate" models have the same true performance as the "real" model you usually build from all samples. (The breakdown of this assumption is the well-known pessimistic bias).
that the k "surrogate" models have the same true performance (are equivalent, have stable predictions), so you are allowed to pool the results of the k tests.
Of course then not only the k "surrogate" models of one iteration of cv can be pooled but the ki models of i iterations of k-fold cv.
Why iterate?
The main thing the iterations tell you is the model (prediction) instability, i.e. variance of the predictions of different models for the same sample.
You can directly report instability as e.g. the variance in prediction of a given test case regardless whether the prediction is correct or a bit more indirectly as the variance of $\hat p$ for different cv iterations.
And yes, this is important information.
Now, if your models are perfectly stable, all $n_{bootstrap}$ or $k \cdot n_{iter.~cv}$ would produce exactly the same prediction for a given sample. In other words, all iterations would have the same outcome. The variance of the estimate would not be reduced by the iteration (assuming $n - 1 \approx n$). In that case, assumption 2 from above is met and you are subject only to $\sigma^2 (\hat p) = \frac{p (1-p)}{n}$ with n being the total number of samples tested in all k folds of the cv.
In that case, iterations are not needed (other than for demonstrating stability).
You can then construct confidence intervals for the true performance $p$ from the observed no of successes $k$ in the $n$ tests. So, strictly, there is no need to report the variance uncertainty if $\hat p$ and $n$ are reported. However, in my field, not many people are aware of that or even have an intuitive grip on how large the uncertainty is with what sample size. So I'd recommend to report it anyways.
If you observe model instability, the pooled average is a better estimate of the true performance. The variance between the iterations is an important information, and you could compare it to the expected minimal variance for a test set of size n with true performance average performance over all iterations.
The described difference is IMHO bogus.
You'll observe it only if the distribution of truely positive cases (i.e. reference method says it is a positive case) is very unequal over the folds (as in the example) and the number of relevant test cases (the denominator of the performance measure we're talking about, here the truly positive) is not taken into account when averaging the fold averages.
If you weight the first three fold averages with $\frac{4}{12} = \frac{1}{3}$ (as there were 4 test cases among the total 12 cases which are relevant for calculation of the precision), and the last 6 fold averages with 1 (all test cases relevant for precision calculation), the weighted average is exactly the same you'd get from pooling the predictions of the 10 folds and then calculating the precision.
edit: the original question also asked about iterating/repeating the validation:
yes, you should run iterations of the whole $k$-fold cross validation procedure:
From that, you can get an idea of the stability of the predictions of your models
- How much do the predictions change if the training data is perturbed by exchanging a few training samples?
- I.e., how much do the predictions of different "surrogate" models vary for the same test sample?
You were asking for scientific papers:
- search terms are iterated or repeated cross validation.
- Papers that say "you should do this":
- Dougherty, E. R.; Sima, C.; Hua, J.; Hanczar, B. & Braga-Neto, U. M.: Performance of Error Estimators for Classification Current Bioinformatics, 2010, 5, 53-67. is a good starting point.
- For spectroscopic data, I did some simulations Beleites, C.; Baumgartner, R.; Bowman, C.; Somorjai, R.; Steiner, G.; Salzer, R. & Sowa, M. G.: Variance reduction in estimating classification error using sparse datasets. Chem.Intell.Lab.Syst., 2005, 79, 91 - 100.
preprint
- I use it regularly, e.g Beleites, C.; Geiger, K.; Kirsch, M.; Sobottka, S. B.; Schackert, G. & Salzer, R.: Raman spectroscopic grading of astrocytoma tissues: using soft reference information.Anal Bioanal Chem, 2011, 400, 2801-2816
Underestimating variance
Ultimately, your data set has finite (n = 120) sample size, regardless of how many iterations of bootstrap or cross validation you do.
You have (at least) 2 sources of variance in the resampling (cross validation and out of bootstrap) validation results:
- variance due to finite number of (test) sample
- variance due to instability of the predictions of the surrogate models
If your models are stable, then
- iterations of $k$-fold cross validation were not needed (they don't improve the performance estimate: the average over each run of the cross validation is the same).
- However, the performance estimate is still subject to variance due to the finite number of test samples.
- If your data structure is "simple" (i.e. one single measurement vector for each statistically independent case), you can assume that the test results are the results of a Bernoulli process (coin-throwing) and calculate the finite-test-set variance.
out-of-bootstrap looks at variance between each surrogate model's predictions. That is possible with the cross validation results as well, but it is uncommon. If you do this, you'll see variance due to finite sample size in addition to the instability. However, keep in mind that some pooling has (usually) taken place already: for cross validation usually $\frac{n}{k}$ results are pooled, and for out-of-bootstrap a varying number of left out samples are pooled.
Which makes me personally prefer the cross validation (for the moment) as it is easier to separate instability from finite test sample sizes.
Best Answer
Yes! Calculate one f1 for each run of cross-validation and average over the
N
runs. This is also a great opportunity to see the difference between this approach and calculating f1 for each fold and averaging over different folds differ from each other.Also Yes! Good approach. In applications, it is sometimes not about being 100% correct but applying methods and techniques according to their ease of use.
However, whenever you are reporting the mean, please also report the variance or the standard deviation.