Cross-Validation – Understanding Cross-Validation and Test Set

cross-validationmachine learningmodel selection

Apologies if this has been answered before elsewhere. Answers I have read so far have only confused me further.
Essentially, I want to check whether I can use the test set to choose betweeen two different models (say a SVR and a random forest regressor), after I have tuned their optimal parameters through cross-validation.

Here's my workflow:

  • I have divided my dataset into a training and a test set.
  • I use cross-validation with $k$-fold on the training set to select the model's best hyperparameters (i.e. those that will minimise the CV-error). This would be for example via a grid search to select the max_depth of a random forest regressor.
  • once the hyperparameters have been chosen, I fit the corresponding model on the whole training set.
  • I can then evaluate its performance on the test set.

Now I want to choose between the SVR and the random forest regressor.

  • Do I compare their performance on the test set and choose the one with the lowest error? In doing so, am I not contaminating the design of the model with knowledge of the test set?
  • If the above is not possible because the test set is supposed to be treated as unseen data, do I then choose the tuned model that had the lowest CV-error between the two? In that case, what's the point of having the test set at all and am I not wasting valuable data by setting it aside?

Thanks

Best Answer

You are correct, choosing between two algorithms based on their test set evaluations causes contamination of your test set. What you should do is select the best performing algorithm during the CV stage and evaluate only that algorithm on the test set. Think about it this way, when you are choosing between a random forest with tree depth x vs tree depth y during CV this is no different than choosing between random forest with tree depth x and a SVR. Consider the algorithm selection as part of the hyper parameter tuning.

It is important to note also that the test set is never wasted data. After the generalization error is estimated using the test set, the appropriate action is to recombine your data again and perform CV on the whole dataset road select again your hyper parameters/algorithm. In this second iteration ignore the error rates you find as you already determined that in the previous step

Related Question