The larger my test set is, the smaller gets the train set, so I discard potential information. Can this be solved via a "stacked" n-fold cv?
Yes. It is usually called nested or double cross validation, and we have a number of questions and answers about that. You could start e.g. with Nested cross validation for model selection
Do I really have to make a REPEATED n-fold cv? Are there other possibilities?
Repetitions / iterations in resampling validation help only if the (surrogate) models are unstable. If you are really sure your models are stable (but how can you be when having concerns about small sample size?) then you don't need the iterations / repetitions. OTOH, IMHO the easiest way to prove that the models are stable is running a few iterations and look at the stability of the predictions.
Is the error rate an appreciate loss function or should I choose another one (eg. the empirical error function or MSE, but then I'd need a probability output, right?)?
No, overall error rate is not a very good loss function, particularly not for optimization. MSE is much better, it is a proper scoring rule. Yes, proper scoring rules need probability output.
However, SVM are anyways quite ugly to optimize as they do not react continuously to small continuous changes in the training data + hyperparameters: up to a certain limit nothing changes (i.e. the same cases stay support vectors), then suddenly the support vectors change.
See also
You are very close to understanding k-fold cross-validation. To answer your questions in turn.
1. So to use k-fold cross validation the required data is the labeled data?
Yes, you must have some 'known' result in order for your model to be trained on the data. You are building a model, I assume, to predict some sort of outcome either regression or classification. In order to do so, a model must be built on data to explain some known result.
2. How about non labeled data?
For k-fold cross-validation, you will have split your data into k groups (e.g. 10). You then select one of those groups and use the model (built from your training data) to predict the 'labels' of this testing group. Once you have your model built and cross-validated, then it can be used to predict data that don't currently have labels. The cross-validation is a means to prevent overfitting.
As a last clarification, you aren't only using 1 of the 10 groups. Let's say you had 100 samples. You split it into groups 1-10, 11-20, ... 91-100. You would first train on all the groups from 11-100 and predict the test group 1-10. Then you would repeat the same analysis on 1-10 and 21-100 as the training and 11-20 as the testing group and so forth. The results typically averaged at the end.
As a simple example say I have the following abbreviated data (binary classification):
Label Variable
A 0.354
A 0.487
A 0.384
A 0.395
A 0.436
B 0.365
B 0.318
B 0.327
B 0.381
B 0.355
Let's say I want to do 10-fold cross-validation on this (nearly Leave-One-Out cross-validation in this case)
My first testing group will be:
A 0.354
A 0.487
My training set is the remaining data. See how the labels are present in both groups?
A 0.384
A 0.395
A 0.436
B 0.365
B 0.318
B 0.327
B 0.381
B 0.355
Please note that it is also best practice to randomize the grouping, this is purely for demonstration
Then you fit your model to the training set, which is using the variable(s) to best explain the labels (class A or B). The model that has been fit to this training set is then used to predict the testing dataset. You remove the labels from the testing set and predict them using the trained model. You then compare the predicted labels to the actual labels. This is repeated for all 10-folds and the results averaged.
Once everything is completed and you have your wonderfully cross-validated model, you can use it to predict unlabeled data and have some sort measure of confidence in your results.
Extended for Parameter Tuning
Let's say you are tuning a partial least squares (PLS) model (it doesn't matter if you don't know what this is for demonstration purposes). I would like determine how many components (the tuning parameter) I should have in the model. I would like to test 2,3,4, and 5 components and see how many I should use to maximize my predictive accuracy without overfitting the model. I would conduct the entire cross-validation series for each component number. Each iteration of the CV would be averaged (the average predictive accuracy of the entire analysis).
Assuming classification accuracy is your metric let's say these are my results (completely made up here):
2 components: 70%
3 components: 82%
4 components: 78%
5 components: 74%
Clearly, I would then choose 3 components for my model which has now been cross-validated to avoid overfitting and maximizing predictive accuracy. I can then use this optimized model to predict a new dataset where I don't know the labels.
Best Answer
Very interesting question, I'll have to read the papers you give... But maybe this will start us in direction of an answer:
I usually tackle this problem in a very pragmatic way: I iterate the k-fold cross validation with new random splits and calculate performance just as usual for each iteration. The overall test samples are then the same for each iteration, and the differences come from different splits of the data.
This I report e.g. as the 5th to 95th percentile of observed performance wrt. exchanging up to $\frac{n}{k} - 1$ samples for new samples and discuss it as a measure for model instability.
Side note: I anyways cannot use formulas that need the sample size. As my data are clustered or hierarchical in structure (many similar but not repeated measurements of the same case, usually several [hundred] different locations of the same specimen) I don't know the effective sample size.
comparison to bootstrapping:
iterations use new random splits.
the main difference is resampling with (bootstrap) or without (cv) replacement.
computational cost is about the same, as I'd choose no of iterations of cv $\approx$ no of bootstrap iterations / k, i.e. calculate the same total no of models.
bootstrap has advantages over cv in terms of some statistical properties (asymptotically correct, possibly you need less iterations to obtain a good estimate)
however, with cv you have the advantage that you are guaranteed that
some classification methods will discard repeated samples, so bootstrapping does not make sense
Variance for the performance
short answer: yes it does make sense to speak of variance in situation where only {0,1} outcomes exist.
Have a look at the binomial distribution (k = successes, n = tests, p = true probability for success = average k / n):
$\sigma^2 (k) = np(1-p)$
The variance of proportions (such as hit rate, error rate, sensitivity, TPR,..., I'll use $p$ from now on and $\hat p$ for the observed value in a test) is a topic that fills whole books...
Now, $\hat p = \frac{k}{n}$ and therefore:
$\sigma^2 (\hat p) = \frac{p (1-p)}{n}$
This means that the uncertainty for measuring classifier performance depends only on the true performance p of the tested model and the number of test samples.
In cross validation you assume
that the k "surrogate" models have the same true performance as the "real" model you usually build from all samples. (The breakdown of this assumption is the well-known pessimistic bias).
that the k "surrogate" models have the same true performance (are equivalent, have stable predictions), so you are allowed to pool the results of the k tests.
Of course then not only the k "surrogate" models of one iteration of cv can be pooled but the ki models of i iterations of k-fold cv.
Why iterate?
The main thing the iterations tell you is the model (prediction) instability, i.e. variance of the predictions of different models for the same sample.
You can directly report instability as e.g. the variance in prediction of a given test case regardless whether the prediction is correct or a bit more indirectly as the variance of $\hat p$ for different cv iterations.
And yes, this is important information.
Now, if your models are perfectly stable, all $n_{bootstrap}$ or $k \cdot n_{iter.~cv}$ would produce exactly the same prediction for a given sample. In other words, all iterations would have the same outcome. The variance of the estimate would not be reduced by the iteration (assuming $n - 1 \approx n$). In that case, assumption 2 from above is met and you are subject only to $\sigma^2 (\hat p) = \frac{p (1-p)}{n}$ with n being the total number of samples tested in all k folds of the cv.
In that case, iterations are not needed (other than for demonstrating stability).
You can then construct confidence intervals for the true performance $p$ from the observed no of successes $k$ in the $n$ tests. So, strictly, there is no need to report the variance uncertainty if $\hat p$ and $n$ are reported. However, in my field, not many people are aware of that or even have an intuitive grip on how large the uncertainty is with what sample size. So I'd recommend to report it anyways.
If you observe model instability, the pooled average is a better estimate of the true performance. The variance between the iterations is an important information, and you could compare it to the expected minimal variance for a test set of size n with true performance average performance over all iterations.