In the K-Fold method, do we still hold out a test set for the very end, and only use the remaining data for training and hyperparameter tuning (ie. we split the remaining data into k folds, and then use the average accuracy after training with each fold (or whatever performance metric we choose) to tune our hyperparameters)?
Yes. As a rule, the test set should never be used to change your model (e.g., its hyperparameters).
However, cross-validation can sometimes be used for purposes other than hyperparameter tuning, e.g. determining to what extent the train/test split impacts the results.
Let me chime in from a different point of view:
"Cross validation" and "validation set" are concepts that are orthogonal/independent in the sense that:
- Validation set is about asking how many/which separate data subsets do I need?
- Whereas cross validation is one possible answer to the question how do I generate/split my data to produce these subsets?
The original purpose of validation sets (1.), was, well, validation (or rather verification), i.e. measuring the generalization performance of the already trained model.
In that sense, yes, you do need a validation set. Note though, that this validation set I'm talking about has a totally different purpose of @Jai's validation set (see below).
Cross validation (2) is one very widely applied scheme to split your data so as to generate pairs of training and validation sets. Alternatives range from other resampling techniques such as out-of-bootstrap validation over single splits (hold out) all the way to doing a separate performance study once the model is trained.
At some point, there was the necessity to do some fine-tuning of hyperparameters. Unfortunately, instead of saying: fine, our new training algoritm internally does an optimization on generalization error, and therefore we split the training set again into a hyperparameter optimization set and a normal parameter fit set, the former valiation set was used for optimization. Because that is really part of the training, another set to estimate the final model's performance was needed. I.e. a set that does what the validation set used to do. This needed another name, and bekame known as test set.
In my experience this historic naming scheme train-validate-test creates a lot of confusion, particularly in fields where verification and validation were already established terminology for studying/demonstrating the predictive performance of methods.
Personally, I therefore prefer to speak
- either of training-optimization-verification or
of training and verification/validation pointing out that inside your training you can do whatever further splits you like.
This point of view has the advantage, that it is much easier to see which set of hyperparameters should be used when doing the final training with the whole data set.
Maybe this also helps to explain:
why setting the hyperparameters to best fit the validation set is right, and doing that for the test set is wrong, if they both come from the same distribution? Both are the same way of cheating the way I see it.
The idea is that during training you are allowed (and supposed) to find out as much as possible about this distribution. Validation/verification then is to prove how much about this distribution was actually learned. And hyperparameter tuning really is part of the training.
Another analogy to the training-optimization-verification splitting is school: training when a concept is explained to you. You then may do some test exams to to challenge and fine-tune your understanding of the concept. Finally there is an exam to demonstrate the learned ability. Even if you do another round of fine-tuning your concept after the exam, the mark is set. The same with a model, just that we know for many practically relevant situations that there is much higher danger of overfitting with our models, so we just don't accept any claim of improvement over the validation (exam) without proof (another validation, re-taking the exam).
Now for each of these splitting steps, you need to decide how to do this. Doing single splits leads to the fixed train-optimize-verify (aka train-validate-test) approach. Doing cross validation for both is called nested or double cross validation. Your intermittent cross validation corresponds to doing cross validation for the (train+optimize) vs. verify split, and a single split for train vs. optimize.
Would it be reasonable to think that they changed the hyperparameters in each of the 10 iterations (where at the same time they were also changing the training and validation data, since that is what K-fold cross validation does), and then they went with the set of hyperparameters that gave the best test accuracy during that process?
No, this is not a good idea
A valid approach would be in each fold to optimize training, and record the test results. This basically corresponds to a cross validation of a training procedure that does a single split into train and optimization data sets internally.
Best Answer
You are right in that when no hyperparameters are tuned a single split into training and testing is all you usually do for an internal* generalization error estimate.
Validation is however, a somewhat ambiguous term here (see here for my take on the historical reasons). Do not confuse not having (or not seeing) the middle data set of the famous train/validation/test split with the need of verification and validation of the model in the engineering (or application field) sense of the word. That latter need is not touched at all by the way you organize your model training.
* this internal refers to the fact that training and test data are produced by splitting one larger data set/from the same lab/data source. This again is more the engineering terminology.
There is nothing inherently bad in evaluating them multiple times. The trouble arises from
This again is not wrong in itself, as long as any further conclusions or actions take this into account. But not taking this into account can lead to serious overestimation of the generalization error estiate's quality.
no, we're one step further here in our considerations:
When selecting hyperparameters based on the validation set (aka inner test set aka development set aka optimization set) error estimate, the validation set becomes part of the training of the final model.
The risk of overfitting during the hyperparameter estimation increases among other factors with the variance uncertainty of the error estimate used to guide the model selection. This is where k-fold is better than a single split since more cases tested means lower uncertainty due to the finite sample tested.
Another important factor is the number of hyperparater sets you select from (the size of your search space).
From a stats point of view, selecting the best hyperparameter set is a multiple comparison situation, and the more comparisons and the more variance on the performance estimates, the larger the risk to select a model that only accidentally seemed to be better. This is what overfitting to the validation set means.