Solved – Should I further tune the model based on results on test set or not

cross-validationmachine learning

I understand that we need to split our data into training, validation and test set – we use training set to train the model, and use cross validation on the validation set to tune the model, and finally, we want to use the set aside testing set that's never seen by the model to get an honest representation of the generalized performance on population or unseen data.

However, I am not sure whether we should further tune our model after getting a test result on the test set, in particular, whether to optimize the model based on the test result based on the test set.

My understanding is, if we see a certain parameter setting seems to lift the test set performance, and as a result we use that parameter setting vs. the previous one, this is "data leakage" – giving knowledge to the model of the data, resulting in overfitting.

On the other hand, if we don't do anything, it doesn't make any sense to use the test set for more than once then, maybe just one single time at the end of everything in terms of model building and evaluation. But again, what if the performance is really bad based on the test set, do we not go back to the model and further test other parameter combinations not already used in cross-validation? If we do, it seems we're coming back to the question in the previous paragraph, we're overfitting…really struggling with this process.

Also, it seems these two posts suggest different solutions. The first post indicates we could further tune based on test result. However, the second post here clearly says we should do nothing! But, again, if we do nothing, that implies we can only use the test set ONE TIME…

Can someone please help me clarify these concerns? Thanks in advance!

Best Answer

what if the performance is really bad based on the test set, do we not go back to the model and further test other parameter combinations not already used in cross-validation?

Of course you'll do that. But the point is that you can use any given data set only in one way: either for measuring generalization performance or fitting the model (and tuning really is nothing else but part of the training process).
So once you start using the next data set for training (regardless wheter you call it tuning, fine-tuning, whatever: if it does influence the model, I'll call it training. Thus training includes selecting appropriate hyperparameters), you'll need to get a still unknown data set for measuring the generalization performance of the new model - or you need to state clearly that you don't have any such data and the reported performance is subject to an optimistic bias.


Whether the later is a viable option IMHO depends a lot on the actual situation: how much bias is expected: did you pick the best of 10000 models on the basis of 25 cases or did you "only" check whether PCA regularization helps and did not find any difference and report that? As a scientific reviewer, I often put more emphasis on the question whether the authors are aware of possible overfitting issues and limit the conclusions correspondingly than requiring an absolutely independent test at a stage where only a few example cases are available that anyways do not cover half of the relevant confounders.

You could e.g. argue that you are in a hypothesis-generating stage than the hypothesis-testing stage. In medical research, you'd maybe do a lot of tuning on incrementally growing data sets, and when you are convinced of your model, you'll anyways have to go and get funding for a double-blinded validation study. In such a situation, it would be a complete waste of resources to demand a fully grown validation at every step.
BUT: this holds only as long as you are aware of the risk of overfitting, and avoid it as much as possible, limit your conclusions AND report all you have done to your data (search term: data dredging).

OTOH, if this is the classifier you're starting to sell as pedestrian recognizing brake assistant for automated driving, not having a proper validation study is no option...


update:

Your choice is basically between

  • a model with properly estimated generalization performance - which turned out to be too bad to be useful, or
  • a possibly better model of which you do not know the generalization performance as that model was generated/tuned using also the "test" set which now cannot be used any more for proper estimation of generalization performance.

In order to properly estimate generalization performance of the tuned model, you need to obtain again an test set that is independent of all data used so far.

Related Question