Machine Learning – Is Using Training and Test Sets for Hyperparameter Tuning Overfitting?

cross-validationmachine learningoverfitting

You have a training and a test set. You combine them and do something like GridSearch to decide the hyperparameters of the model. Then, you fit a model on the training set using these hyperparameters, and you use the test set to evaluate it.

Is this overfitting ? Ultimately, the data was not fitted on the test set, but the test set was considered when deciding hyperparameters.

Best Answer

The idea behind holdout and cross validation is to estimate the generalization performance of a learning algorithm--that is, the expected performance on unknown/unseen data drawn from the same distribution as the training data. This can be used to tune hyperparameters or report the final performance. The validity of this estimate depends on the independence of the data used for training and estimating performance. If this independence is violated, the performance estimate will be overoptimistically biased. The most egregious way this can happen is by estimating performance on data that has already been used for training or hyperpameter tuning, but there are many more subtle and insidious ways too.

The procedure you asked about goes wrong in multiple ways. First, the same data is used for both training and hyperpameter tuning. The goal of hyperparameter tuning is to select hyperparameters that will give good generalization performance. Typically, this works by estimating the generalization performance for different choices of hyperparameters (e.g. using a validation set), and then choosing the best. But, as above, this estimate will be overoptimistic if the same data has been used for training. The consequence is that sub-optimal hyperparameters will be chosen. In particular, there will be a bias toward high capacity models that will overfit.

Second, data that has already been used to tune hyperparameters is being re-used to estimate performance. This will give a deceptive estimate, as above. This isn't overfitting itself but it means that, if overfitting is happening (and it probably is, as above), then you won't know it.

The remedy is to use three separate datasets: a training set for training, a validation set for hyperparameter tuning, and a test set for estimating the final performance. Or, use nested cross validation, which will give better estimates, and is necessary if there isn't enough data.