Machine Learning – Predicting on Test Set Using All Models Including Suboptimal Ones

cross-validationmachine learning

I refer to this extremely well explained post here. Link to the post

The confusion arises because after you have tuned your hyperparameters, and found the best performing classifier, be it a single model, or an convoluted ensemble, you found that that hypothesis $g$ performs best during cross-validation. The idea is then also use this $g$ to predict on X_test to see how well it does on the unseen data.

My question is, out of the all the models I trained, say $h_1, h_2, …$ from various hypothesis space, or to put it simply, I trained a linear model, a tree model, and another ensemble model like Random Forest. Then eventually, I found the random forest to do best on the validation set, at last, I will use this model to predict on X_test to assess it. Why can't I use all 3 trained models to also perform on test set after the dust is settled…? Is it because it can introduce "my own bias" after I realized that maybe the ensembled model didn't get the highest score on X_test?

Best Answer

Any time you use a chunk of data to search through a possible space (whether that’s find best coefficients for a regression or choosing between RNN and CNN) you are by definition fitting to that data.

  • Training data: Used to fit a particular model.
  • Validation data: Used to search through the space of models to find best hyper parameters.
  • Testing data: Used to check whether your best model from Validation performs well OOS.

That’s the usual set up. However, as you said in your question you might want to take a couple of models from validation (say the top 5) and see which one generalises best on the testing.

As I said above, you will now be fitting to the test data and introducing the chance that you picked a model which got lucky. In this case we would create a fourth data set:

  • Holdout data: Kept completely separate and used for the sole purpose of assessing the OOS of a potential production model. Only one model should get to see this data.