Solved – after using cross-validation, is a separate train-test split necessary for generating a model

cross-validationtrain

I am going through the excellent book "Introduction to Machine Learning with Python," and reading about cross-validation. I can understand how it makes a more efficient use of the data than a typical train-test split, but the book also contains the caveat:

It is important to keep in mind that cross-validation is not a way to
build a model that can be applied to new data. Cross-validation
does not return a model… multiple
models are built internally, but the purpose of cross-validation is
only to evaluate how well a given algorithm will generalize when
trained on a specific dataset.

So if cross-validation doesn't produce a model, does that mean that after performing cross validation, I need to build a model in the typical method using a train/test set? If so that would imply that my cross-validation scores would typically be higher than my final model's scores, since cross-validation makes a more efficient use of the data.

Or is it held that after cross-validation, I can simply train my model against all data without any further test set? That would mean that I've never tested my actual model, so it sounds wrong, but perhaps cross-validation is a valid test since it uses every sample in both training and test? If so it implies that my cross-validation scores would typically be lower than my final model's scores, since only my final model would be trained against all samples.

Best Answer

Or is it held that after cross-validation, I can simply train my model against all data without any further test set?

Yes - the cross validation is a (more efficient) replacement for that test set.

That would mean that I've never tested my actual model, so it sounds wrong, but perhaps cross-validation is a valid test since it uses every sample in both training and test?

CV treats its training sets as a good approximation to the whole data set (as do other types of resampling validation such as out-of-bootstrap etc.), so it is approximately right.

There are numerous studies on the error you make with different validation schemes that consider the total = systematic + random error (bias + variance). Turns out, with sample sizes (< a few 1000 independent cases) it is better than the alternative of train-test-split where - as you say - you have the advantage of being unbiased but pay for this with much higher variance.

If so it implies that my cross-validation scores would typically be lower than my final model's scores, since only my final model would be trained against all samples.

Yes - cross validation will have a slight pessimistic bias if done correctly (depending on the slope of the learning curve between the CV train sample size and the total sample size). You trade that for less variance (depending on the test sample size for the train-test split and the total sample size for CV).