Solved – is it a good practice to use K-Fold cross validation instead of training, validation and test set

cross-validationmachine learning

i have a Dataset of 5000 samples for a regression problem, now with this number of samples can i and is it better to use K-Fold cross validation instead of the validation set as an alternative?

if not i've been splitting data as 70% 15% 15%, but it doesn't seem the distribution of splits are exactly the same, i mean depending on the split i would get very different predictions, for example in split i tune the model with validation set and get 0.75 $R^2$ for validation set and then get a $R^2$ if 0.88 for test set, and in another instance its the exact opposite.

if i do a few random splits and find a few splits that the model structure is not very different in each split, can i just use one of these splits? wouldn't that defeat the purpose of the randomness of splits? out of these splits i chose two random splits, now after training, tuning and all the other steps, for one of the splits i would get the following $R^2$ for train, dev and test set: 0.91, 0.92, 0.90 and for the other one i would get these: 0.93, 0.88, 0.85. can i use the first one that gave me the best result? is it valid? i mean depending on the split it seems my final test score would be totally different.

Best Answer

Your experience with split-dependent differences in your modeling and performance estimates is a reason why cross validation is often preferred in all but extremely large data sets. Repeated cross validation or bootstrapping might be even better.

With the separate training/validation/test approach, your estimate of the generalizability of your model's performance on new data comes from the test set, which is set aside until initial training and tuning are done on the training and validation sets. Say that you want to estimate the error in model predictions made on your test set. You are then trying to estimate a variance, which can require a surprisingly large number of cases to estimate precisely.

But if you set aside more cases for the test set so that you have a better measure of generalizability, you have fewer cases available for training and validation, which may limit your ability to generate a useful model in the first place.

The chapter on cross validation and related methods in The Elements of Statistical Learning (2nd edition, p. 222) puts it like this:

The methods in this chapter are designed for situations where there is insufficient data to split it into three parts. Again it is too difficult to give a general rule on how much training data is enough; among other things, this depends on the signal-to-noise ratio of the underlying function, and the complexity of the models being fit to the data.

So cross validation is a useful approach in cases where you don't have a "large enough" data set to accomplish your goals. Your question suggests that you might be in such a situation despite having 5000 cases.

In practice, a single run of cross validation can give imprecise results. Frank Harrell recommends repeated runs of cross validation or, better, bootstrapping to take advantage of all the data most efficiently in building and evaluating a model. See for example this answer, with a link to further reading in a comment. His rms package provides tools for building, validating, and calibrating models.

Related Question