Cross-Validation – Is Hold-Out Validation a Better Approximation of ‘Getting New Data’ Than K-Fold CV?

cross-validation

I've been rethinking an answer I gave to a question a couple weeks ago

Hold-out cross-validation produces a single test set that can be used repeatedly for demonstration. We all seem to agree that this is in many ways a negative feature, since the one held-out set could turn out to be non-representative through randomness. Moreover, you could end up overfitting to the test data in the same way you can overfit to the training data.

However, it seems to me that the static nature of a held-out sample is a better approximation of "getting more data" than k-fold CV, and avoids the issue of averaging across folds. I can't, however, come up with any statistical basis for this feeling I have. Is there any logic in my intuition?

For instance, what I have in mind for an upcoming project is first using hold-out validation to build and test a model, then as a validation step re-drawing the hold-out set several times to show that my estimates of prediction error (on the test set) are robust to sampling error in the test set. Is this a bad idea for any reason? This question was asked before but never received an answer.

Best Answer

IMHO one of the worst properties of hold-out validation is psychological rather than statistical: I see a lot of hold-out which is interpreted as if it were an independent validation experiment (with independence already at the experimental level), although many of the crucial problems that I see with resampling validation can and will happen just the same with hold-out as well (any problem that arises from improper splitting).

Other than that, IMHO it is almost the same as resampling (at least as I've seen it done in practice). Differences are

the total number of actually different tested cases is lower (and consequently the estimate is less certain).
With hold-out, the performance is claimed for the actually tested model, not for an actually untested model built from the hold-out traing plus the hold-out test data. Resampling claims that the measured performance is a good approximation to the performance of the latter model. But I've also seen the hold-out approach used this way ("set validation").

Esbensen and Geladi: Principles of Proper Validation: use and abuse of re-sampling for validation, Journal of Chemometrics, 24 (3-4), 168-187 argues that in practical terms, both are not very good approximations for data sets (validation experiments) that allow to measure the really interesting performance characteristics.

you could end up overfitting to the test data in the same way you can overfit to the training data.

Same as with any other validation: if you do data-driven modeling/model selection, another independent level of validation is needed. I don't see any difference here between hold-out and resampling schemes.

first using hold-out validation to build and test a model, then as a validation step re-drawing the hold-out set several times to show that my estimates of prediction error (on the test set) are robust to sampling error in the test set. Is this a bad idea for any reason?

I think so, yes: IMHO a nested set-up should be used
(unless you want to suggest that hold-out validation could and should be repeated as well - that is a valid approach which differs from iterated/repeated set validation only by interpretation: whether the performance statement is about the many actually tested models or whether is is extrapolated to the one model built of all data).

Related Solutions

Solved – How to implement a hold-out validation in R

It is hard to tell exactly what you are interested in. If it is a single hold-out, you can use

trainControl(method = "LGOCV", p = .8, number = 1)

and 80% will be used for training.

There is also method = "none" that will just fit the model for a single tuning parameter value (using the entire training set).

Also, if you want to use your own hold-out set(s), see the index argument of trainControl.

Max

Solved – Find out if using k-fold cross-validation helped to overcome overfitting (Machine Learning standard)

if all K repeats of training and evaluating the model give nearly the same performance ,this indicates that we overcome the overfitting issue

That is not true, see below.

How much variance is a good indicator that we don't have overfitting any more? (meaning how much as giving an number such as < 0.2, for example).

Variance may be one of the symptoms of overfitting, but

there are more direct indicators of overfitting
and you may observe testing variance due to having only few test cases also for non-overfit models.
However, model instability (variance between models and between the predictions for the same case) almost always comes with overfitting. Depending on the modeling algorithm, instability may be measured by comparing the fitted model parameters across the k surrogate models (but there can be "spurious" variance between equivalent models that have different parameter sets.). Also instability can be measured more directly by comparing predictions for the same test case across models from different runs/iterations/repetitions of k-fold cross validation.

which one of mean squared error, root mean squared error or median absolute deviation or performance variance measurement is best?

The figure of merit doesn't matte at all. It just needs to be a sensible metric of performance for your application.

Note that variance can be calculated for each of the errors / loss functions.

How to measure overfitting

I think the most straightforward way of measuring overfitting is to compare the model's internal estimate of error (e.g. training error, inner cross validation error with data-driven optimization/selection) with the external independent error estimate. A large discrepancy between those two estimates indicates overfitting.

Large should be seen as compared to the variance that results from testing only the finite number of cases you have at hand.