Let me chime in from a different point of view:
"Cross validation" and "validation set" are concepts that are orthogonal/independent in the sense that:
- Validation set is about asking how many/which separate data subsets do I need?
- Whereas cross validation is one possible answer to the question how do I generate/split my data to produce these subsets?
The original purpose of validation sets (1.), was, well, validation (or rather verification), i.e. measuring the generalization performance of the already trained model.
In that sense, yes, you do need a validation set. Note though, that this validation set I'm talking about has a totally different purpose of @Jai's validation set (see below).
Cross validation (2) is one very widely applied scheme to split your data so as to generate pairs of training and validation sets. Alternatives range from other resampling techniques such as out-of-bootstrap validation over single splits (hold out) all the way to doing a separate performance study once the model is trained.
At some point, there was the necessity to do some fine-tuning of hyperparameters. Unfortunately, instead of saying: fine, our new training algoritm internally does an optimization on generalization error, and therefore we split the training set again into a hyperparameter optimization set and a normal parameter fit set, the former valiation set was used for optimization. Because that is really part of the training, another set to estimate the final model's performance was needed. I.e. a set that does what the validation set used to do. This needed another name, and bekame known as test set.
In my experience this historic naming scheme train-validate-test creates a lot of confusion, particularly in fields where verification and validation were already established terminology for studying/demonstrating the predictive performance of methods.
Personally, I therefore prefer to speak
- either of training-optimization-verification or
of training and verification/validation pointing out that inside your training you can do whatever further splits you like.
This point of view has the advantage, that it is much easier to see which set of hyperparameters should be used when doing the final training with the whole data set.
Maybe this also helps to explain:
why setting the hyperparameters to best fit the validation set is right, and doing that for the test set is wrong, if they both come from the same distribution? Both are the same way of cheating the way I see it.
The idea is that during training you are allowed (and supposed) to find out as much as possible about this distribution. Validation/verification then is to prove how much about this distribution was actually learned. And hyperparameter tuning really is part of the training.
Another analogy to the training-optimization-verification splitting is school: training when a concept is explained to you. You then may do some test exams to to challenge and fine-tune your understanding of the concept. Finally there is an exam to demonstrate the learned ability. Even if you do another round of fine-tuning your concept after the exam, the mark is set. The same with a model, just that we know for many practically relevant situations that there is much higher danger of overfitting with our models, so we just don't accept any claim of improvement over the validation (exam) without proof (another validation, re-taking the exam).
Now for each of these splitting steps, you need to decide how to do this. Doing single splits leads to the fixed train-optimize-verify (aka train-validate-test) approach. Doing cross validation for both is called nested or double cross validation. Your intermittent cross validation corresponds to doing cross validation for the (train+optimize) vs. verify split, and a single split for train vs. optimize.
Would it be reasonable to think that they changed the hyperparameters in each of the 10 iterations (where at the same time they were also changing the training and validation data, since that is what K-fold cross validation does), and then they went with the set of hyperparameters that gave the best test accuracy during that process?
No, this is not a good idea
A valid approach would be in each fold to optimize training, and record the test results. This basically corresponds to a cross validation of a training procedure that does a single split into train and optimization data sets internally.
Best Answer
The underlying difficulty is that cross validation results (actually: all test results) are subject to several sources of variance (read the Dietterich and Yoshua & Bengio papers).
The usual tests the linked blog post discusses all assume that the data can be described using one variance term.
Sources of variance:
For figures of merit that are proportions of tested cases (e.g. accuracy) we can actually estimate this variance based on number of independent test cases and observed proportion via the binomial distribution.
This can be instability originating from
(for discussing k-fold cross validation we'll further divide this below)
Which (part) of these sources of variance is relevant depends on what question is actually asked (Dietterich makes a nice point of this) or in other words in which ways we want to generalize the findings:
Here are some scenarios:
For answering (a), if we directly test the model in question with an independent test set (a verification/validation study), only variance source 1 is relevant: any instability-type variance is part of the performance of the model we actually examine.
So in that scenario, we can use e.g. a paired test (in case both models in question are tested with the same test cases). Which paired test to choose (McNemar vs. t-test vs. other tests) depends on the figure of merit we compare. McNemar for binary outcomes, t-test/z-test for approximately normally distributed figures of merit and so on.
Fortunately, we can estimate this variance as soon as we have sufficient test cases in our testing.
Still question (a): If we don't have independent test data at hand and go for resampling such as cross validation, that will be subject to some bias (depending on the learning curve of the models and the choice of $k$). Plus, instability starts to play a role: the surrogate models we actually test may vary around the average of the learning curve.
However, for the cross validation approximation of the figures of merit still for the models we actually get from the data set at hand, only that instability that happens due to training on a $1 - \frac{1}{k}$ subset of the data set at hand is relevant for the uncertainty of the performance of the model obtained from our data set.
This can be estimated e.g. from repeated/iterated k-fold cross validation or out-of-bootstrap and the like.
Now if we want to generalize both to unknown cases and models that are trained on another data set (of same/similar size) obtained from the same population (question b), we need to know how representative our data set is for the underlying training population. I.e. how much variance in the models we'd get if trained on $n$ new cases. That's what Bengio & Grandvalet are concerned with and what they show cannot be estimated from a single data set. This is also what the 5x2-fold scheme tries to get at - but at the price of a) having substantially smaller training sets for the surrogate models, and b) still having correlation since for each surrogate model, only 1 other surrogate model is independent, the other 8 are correlated as they share cases.
So if
then you could approximately say that all variance comes from the finite number of cases tested and go for the pairwise test just as you'd do for the independent test set.
How to show stability:
via repeated/iterated k-fold: each case is tested exactly once per repetition/iteration. Any variance in the predictions of the same test case must originate from variation between the surrogate models, i.e. instability.
See e.g. our paper: Beleites, C. & Salzer, R.: Assessing and improving the stability of chemometric models in small sample size situations Anal Bioanal Chem, 2008, 390, 1261-1271.
DOI: 10.1007/s00216-007-1818-6
Other resampling schemes (out-of-bootstrap etc.) work as well, as long as you have several predictions of the same test case you can separate that variance from case-to-case variance.
without repeated/iterated k-fold: if the fitted parameters of the surrogate models are equal (or sufficiently similar) we also know that the models are stable. This is a stronger condition than the stability of the predictions, and it'll need some work to establish what order of magnitude of variation is sufficiently small.
Practically speaking, I'd say this may be doable for (bi)linear models where we can directly study the fitted coefficients, but will probably not be feasible for other types of models. (And in any case it may need more time than getting some further iterations of the k-fold while you personally work on other stuff)