I've been rethinking an answer I gave to a question a couple weeks ago
Hold-out cross-validation produces a single test set that can be used repeatedly for demonstration. We all seem to agree that this is in many ways a negative feature, since the one held-out set could turn out to be non-representative through randomness. Moreover, you could end up overfitting to the test data in the same way you can overfit to the training data.
However, it seems to me that the static nature of a held-out sample is a better approximation of "getting more data" than k-fold CV, and avoids the issue of averaging across folds. I can't, however, come up with any statistical basis for this feeling I have. Is there any logic in my intuition?
For instance, what I have in mind for an upcoming project is first using hold-out validation to build and test a model, then as a validation step re-drawing the hold-out set several times to show that my estimates of prediction error (on the test set) are robust to sampling error in the test set. Is this a bad idea for any reason? This question was asked before but never received an answer.
Best Answer
IMHO one of the worst properties of hold-out validation is psychological rather than statistical: I see a lot of hold-out which is interpreted as if it were an independent validation experiment (with independence already at the experimental level), although many of the crucial problems that I see with resampling validation can and will happen just the same with hold-out as well (any problem that arises from improper splitting).
Other than that, IMHO it is almost the same as resampling (at least as I've seen it done in practice). Differences are
Esbensen and Geladi: Principles of Proper Validation: use and abuse of re-sampling for validation, Journal of Chemometrics, 24 (3-4), 168-187 argues that in practical terms, both are not very good approximations for data sets (validation experiments) that allow to measure the really interesting performance characteristics.
Same as with any other validation: if you do data-driven modeling/model selection, another independent level of validation is needed. I don't see any difference here between hold-out and resampling schemes.
I think so, yes: IMHO a nested set-up should be used
(unless you want to suggest that hold-out validation could and should be repeated as well - that is a valid approach which differs from iterated/repeated set validation only by interpretation: whether the performance statement is about the many actually tested models or whether is is extrapolated to the one model built of all data).