I refer to this extremely well explained post here. Link to the post
The confusion arises because after you have tuned your hyperparameters, and found the best performing classifier, be it a single model, or an convoluted ensemble, you found that that hypothesis $g$ performs best during cross-validation. The idea is then also use this $g$ to predict on X_test to see how well it does on the unseen data.
My question is, out of the all the models I trained, say $h_1, h_2, …$ from various hypothesis space, or to put it simply, I trained a linear model, a tree model, and another ensemble model like Random Forest. Then eventually, I found the random forest to do best on the validation set, at last, I will use this model to predict on X_test to assess it. Why can't I use all 3 trained models to also perform on test set after the dust is settled…? Is it because it can introduce "my own bias" after I realized that maybe the ensembled model didn't get the highest score on X_test?
Best Answer
Any time you use a chunk of data to search through a possible space (whether that’s find best coefficients for a regression or choosing between RNN and CNN) you are by definition fitting to that data.
That’s the usual set up. However, as you said in your question you might want to take a couple of models from validation (say the top 5) and see which one generalises best on the testing.
As I said above, you will now be fitting to the test data and introducing the chance that you picked a model which got lucky. In this case we would create a fourth data set: