Very interesting question, I'll have to read the papers you give... But maybe this will start us in direction of an answer:
I usually tackle this problem in a very pragmatic way: I iterate the k-fold cross validation with new random splits and calculate performance just as usual for each iteration. The overall test samples are then the same for each iteration, and the differences come from different splits of the data.
This I report e.g. as the 5th to 95th percentile of observed performance wrt. exchanging up to $\frac{n}{k} - 1$ samples for new samples and discuss it as a measure for model instability.
Side note: I anyways cannot use formulas that need the sample size. As my data are clustered or hierarchical in structure (many similar but not repeated measurements of the same case, usually several [hundred] different locations of the same specimen) I don't know the effective sample size.
comparison to bootstrapping:
iterations use new random splits.
the main difference is resampling with (bootstrap) or without (cv) replacement.
computational cost is about the same, as I'd choose no of iterations of cv $\approx$ no of bootstrap iterations / k, i.e. calculate the same total no of models.
bootstrap has advantages over cv in terms of some statistical properties (asymptotically correct, possibly you need less iterations to obtain a good estimate)
however, with cv you have the advantage that you are guaranteed that
- the number of distinct training samples is the same for all models (important if you want to calculate learning curves)
- each sample is tested exactly once in each iteration
some classification methods will discard repeated samples, so bootstrapping does not make sense
Variance for the performance
short answer: yes it does make sense to speak of variance in situation where only {0,1} outcomes exist.
Have a look at the binomial distribution (k = successes, n = tests, p = true probability for success = average k / n):
$\sigma^2 (k) = np(1-p)$
The variance of proportions (such as hit rate, error rate, sensitivity, TPR,..., I'll use $p$ from now on and $\hat p$ for the observed value in a test) is a topic that fills whole books...
- Fleiss: Statistical Methods for Rates and Proportions
- Forthofer and Lee: Biostatistics has a nice introduction.
Now, $\hat p = \frac{k}{n}$ and therefore:
$\sigma^2 (\hat p) = \frac{p (1-p)}{n}$
This means that the uncertainty for measuring classifier performance depends only on the true performance p of the tested model and the number of test samples.
In cross validation you assume
that the k "surrogate" models have the same true performance as the "real" model you usually build from all samples. (The breakdown of this assumption is the well-known pessimistic bias).
that the k "surrogate" models have the same true performance (are equivalent, have stable predictions), so you are allowed to pool the results of the k tests.
Of course then not only the k "surrogate" models of one iteration of cv can be pooled but the ki models of i iterations of k-fold cv.
Why iterate?
The main thing the iterations tell you is the model (prediction) instability, i.e. variance of the predictions of different models for the same sample.
You can directly report instability as e.g. the variance in prediction of a given test case regardless whether the prediction is correct or a bit more indirectly as the variance of $\hat p$ for different cv iterations.
And yes, this is important information.
Now, if your models are perfectly stable, all $n_{bootstrap}$ or $k \cdot n_{iter.~cv}$ would produce exactly the same prediction for a given sample. In other words, all iterations would have the same outcome. The variance of the estimate would not be reduced by the iteration (assuming $n - 1 \approx n$). In that case, assumption 2 from above is met and you are subject only to $\sigma^2 (\hat p) = \frac{p (1-p)}{n}$ with n being the total number of samples tested in all k folds of the cv.
In that case, iterations are not needed (other than for demonstrating stability).
You can then construct confidence intervals for the true performance $p$ from the observed no of successes $k$ in the $n$ tests. So, strictly, there is no need to report the variance uncertainty if $\hat p$ and $n$ are reported. However, in my field, not many people are aware of that or even have an intuitive grip on how large the uncertainty is with what sample size. So I'd recommend to report it anyways.
If you observe model instability, the pooled average is a better estimate of the true performance. The variance between the iterations is an important information, and you could compare it to the expected minimal variance for a test set of size n with true performance average performance over all iterations.
if all K repeats of training and evaluating the model give nearly the same performance ,this indicates that we overcome the overfitting issue
That is not true, see below.
How much variance is a good indicator that we don't have overfitting any more? (meaning how much as giving an number such as < 0.2, for example).
Variance may be one of the symptoms of overfitting, but
- there are more direct indicators of overfitting
- and you may observe testing variance due to having only few test cases also for non-overfit models.
- However, model instability (variance between models and between the predictions for the same case) almost always comes with overfitting. Depending on the modeling algorithm, instability may be measured by comparing the fitted model parameters across the k surrogate models (but there can be "spurious" variance between equivalent models that have different parameter sets.). Also instability can be measured more directly by comparing predictions for the same test case across models from different runs/iterations/repetitions of k-fold cross validation.
which one of mean squared error, root mean squared error or median absolute deviation or performance variance measurement is best?
The figure of merit doesn't matte at all. It just needs to be a sensible metric of performance for your application.
Note that variance can be calculated for each of the errors / loss functions.
How to measure overfitting
I think the most straightforward way of measuring overfitting is to compare the model's internal estimate of error (e.g. training error, inner cross validation error with data-driven optimization/selection) with the external independent error estimate. A large discrepancy between those two estimates indicates overfitting.
Large should be seen as compared to the variance that results from testing only the finite number of cases you have at hand.
Best Answer
Dealing with regression can be confusing because there are 2 SD. The whole point of the cross validation is to give you an estimate of the future behavior of the regressor. In this case you have 5 estimations of the regressor on future data, one for each fold.
What do you want to know about the regressor on future data:
1) what is the expected error (MSE) on future data - that is the mean of the 5 CV MSEs
2) what is the expected SD of the errors on future data - that is the mean of each CV SD!
You may also want to evaluate how certain you are of those estimates:
3) what is the variance (or SD) of the estimate of future MSE - that is the variance (SD) of the 5 CV MSEs -- this is the one that is low in your example - so you know that your estimate of the future MSE is pretty tight. (Well maybe - there is a paper that shows that the SD of a CV measure - in this case the MSE - is not a good estimate of the true SD - but let us leave this for later)
4) what is the variance (or SD) of the estimate of future SD of the MSE - that is the variance (SD) of the 5 CV SD
So going back to your question
The apparent standard error of the MSE is close to zero (or zero in the extreme case where each fold yeilds the same value). (That is the SD for the estimate of the MSE and not the estimate of the SD for future data) Yet, we know the SD of the MSE for each fold and it is not zero at all. (That is the estimate of future SD)