Solved – Do I stick with the tuned model parameters even if they produce worse test scores

cross-validationparameterization

The shorter and more general version of this question:
If tuning a model via cross-validation (within training set) produces worse results on the test set than my previous default/baseline model, do I stick with the tuned model (to avoid over-fitting on the training data) or do I go back to the baseline (even though it seems like I'm overfitting by doing so)?

Long more detailed scenario:
Let's say I have an 80/20 train/test split set. Then I build a model with default model hyperparameters and obtain an F1 score of 0.35.

Then I use cross-validation on the train set to identify the best hyperparameters and build a new model on all the training data using those hyperparameters found optimal per the cross-validation. However, when I evaluate this "optimal" tuned model on the test set I get an F1 score of 0.23.

In such cases, should I stick to the default hyperparameters that had produced the higher F1 score on the test set or stick with the tuned model since it was tuned using cross-validation?

In case such variation in numbers is unlikely – I guess then I'm wondering whether there may be some other factor at play, such as too small a dataset (e.g. around total 1000 datapoints total with fewer than 200 of them being in the class of interest) or imbalanced fold partitioning.

Best Answer

You can use the defaults. Of course, it's possible to overfit to the test set by trying lots of different hyperparameter values and seeing what performance on the test set they lead to, but if all you're trying is two options, the default values and tuned values, and the default values themselves weren't set with the test set somehow, this kind of overfitting is not a substantial danger.

Note that when tuning hyperparameters worsens test-set performance but improves training-set performance, that's a hint that the tuning procedure is overfitting on the training set. Using regularization or switching to a simpler model may help.

Related Question