Cross-Validation – Is Cross-Validation with No Data Leakage Sufficient to Replace Train-Test Split?

cross-validationtrain-test-split

I would like to seek expert advice on the topic above.

I was taught to follow this workflow:

  1. Split dataset into training and testing
  2. Use training dataset to develop model
    • Set hyperparameter in model
    • Do cross-validation (multiple train-validation splits, tests, measures)
    • Repeat with different hyperparameter set until performance of train is near to validation
  3. Thereafter, check performance of final chosen model on testing dataset

My thoughts are:

  1. Testing dataset is chosen once
  2. It might not be representative
  3. Hence performance might not be representative also, be it good or bad

My questions are:

  1. Given this situation, would result from cross-validation be good gauge of model performance?
  2. If this is the case, then train-test split might not be needed
  3. Of course during cross-validation, it is important not to have data leakage

I would love to hear your thoughts. Thank you very much.

Best Answer

The confusion might stem from a clash of workflows. Your workflow includes a step Repeat with different hyperparameter set until performance of train is near to validation whereas the typical workflow that I am familiar with would include a step Repeat with different hyperparameter set and select the one which performs best on the validation data.

In the latter case, a train-test split is needed for a fair evaluation of the selected model's performance on new data. In the cross-validation stage with hyperparameter tuning, the performance of the best model on the hold-out data is optimistically biased. This is because the selection of the best tuning parameter involves some luck in addition to skill, a luck that would not be replicated on an independent test set. You can see that for yourself using simulation.

If you do not want a fair evaluation of the selected model's performance on new data, then you can skip the train-test split and tune and select the model using cross-validation alone. This would be an efficient use of data for the purpose.

With your workflow, first I am not sure whether it is likely to produce the optimal hyperparameter and then am I not sure what to expect from the selected model applied on new data. Again, simulations could help get some insight into that. There does not seem to be a built-in mechanism for an overly optimistic performance on the validation data, so perhaps the train-test split is unnecessary. But that is just a guess.

Another point: if your sample is representative, it is unlikely that a randomly chosen test set is not representative. If your randomly chosen test set is not representative, then it is likely that your whole sample is not representative, so not splitting into training and test will not help.