Cross-Validation – Is Hold-Out Validation a Better Approximation of ‘Getting New Data’ Than K-Fold CV?

cross-validation

I've been rethinking an answer I gave to a question a couple weeks ago

Hold-out cross-validation produces a single test set that can be used repeatedly for demonstration. We all seem to agree that this is in many ways a negative feature, since the one held-out set could turn out to be non-representative through randomness. Moreover, you could end up overfitting to the test data in the same way you can overfit to the training data.

However, it seems to me that the static nature of a held-out sample is a better approximation of "getting more data" than k-fold CV, and avoids the issue of averaging across folds. I can't, however, come up with any statistical basis for this feeling I have. Is there any logic in my intuition?

For instance, what I have in mind for an upcoming project is first using hold-out validation to build and test a model, then as a validation step re-drawing the hold-out set several times to show that my estimates of prediction error (on the test set) are robust to sampling error in the test set. Is this a bad idea for any reason? This question was asked before but never received an answer.

Best Answer

IMHO one of the worst properties of hold-out validation is psychological rather than statistical: I see a lot of hold-out which is interpreted as if it were an independent validation experiment (with independence already at the experimental level), although many of the crucial problems that I see with resampling validation can and will happen just the same with hold-out as well (any problem that arises from improper splitting).

Other than that, IMHO it is almost the same as resampling (at least as I've seen it done in practice). Differences are

  • the total number of actually different tested cases is lower (and consequently the estimate is less certain).
  • With hold-out, the performance is claimed for the actually tested model, not for an actually untested model built from the hold-out traing plus the hold-out test data. Resampling claims that the measured performance is a good approximation to the performance of the latter model. But I've also seen the hold-out approach used this way ("set validation").

Esbensen and Geladi: Principles of Proper Validation: use and abuse of re-sampling for validation, Journal of Chemometrics, 24 (3-4), 168-187 argues that in practical terms, both are not very good approximations for data sets (validation experiments) that allow to measure the really interesting performance characteristics.

you could end up overfitting to the test data in the same way you can overfit to the training data.

Same as with any other validation: if you do data-driven modeling/model selection, another independent level of validation is needed. I don't see any difference here between hold-out and resampling schemes.

first using hold-out validation to build and test a model, then as a validation step re-drawing the hold-out set several times to show that my estimates of prediction error (on the test set) are robust to sampling error in the test set. Is this a bad idea for any reason?

I think so, yes: IMHO a nested set-up should be used
(unless you want to suggest that hold-out validation could and should be repeated as well - that is a valid approach which differs from iterated/repeated set validation only by interpretation: whether the performance statement is about the many actually tested models or whether is is extrapolated to the one model built of all data).

Related Question