Solved – Cross-validation for (hyper)parameter tuning to be performed in validation set or training set

cross-validationhyperparametermachine learningoverfittingtuning

I am learning about the use of cross-validation with grid-search to choose the best hyperparameter for SVM. The problem I came across is the references and examples of its application do not follow a singular standard.

On the one day, I have seen resource portraying the following steps:

  • 1a) Split data into training and test sets (say 50:50),
  • 1b) Use cross-validation and grid-search only on training set. Identify the hyperparameter set that gives the best performance,
  • 1c) Use the best hyperparameter set to train in the training set,
  • 1d) Lastly, use the trained model (from the best hyperparameter set) to make predictions in test set, and evaluate the performance from the test set.

Another way is the following:

  • 2a) Split data into validation, training, and test sets (say 20:40:40, respectively),
  • 2b) Use cross-validation and grid-search only on validation set. Identify the hyperparameter set that gives the best performance,
  • 2c) Use the best hyperparameter set to train in the training set,
  • 2d) Lastly, use the trained model (from the best hyperparameter set) to make predictions in test set, and evaluate the performance from the test set.

Is approach 2 more preferable than approach 1, or are they both accepted in research and academic settings? Approach 1 seems to be better than approach 2 because it doesn't require to expense the data into a separate validation set that the final SVM algorithm will never train with. Whereas, I have seen people citing approach 2 being more scientifically sound because it is less prone to overfitting in the training data. But the potential issue I see is, if you use a small validation set, the best hyperparameter set may be unreliable, but if you use a large validation set, you lose a lot of valuable data. Which should be used? Or does it depend?

Best Answer

I believe you are looking for a hard-rule stating how the data should be distributed. Well it is totally your call. Approach 1 is most widely used, but a better split would be 50% for training, 30% for validation and 20% for testing. One thing which you have to keep in mind is that your data should be most of the times balanced for better tuning results. One more thing, you may get different tuning results if you keep shuffling your data, so make sure to capture that also.

This data-split has been giving me good results for most of my experiments. I would vote for Approach #1.