Solved – Why will the validation set error underestimate the generalisation error

classificationdistributionshypothesis testingmachine learningvariance

In my book about machine learning the concept of a validation set is introduced. It's a subset of the training set that is used to "train" the hyperparameters. More specifically, the validation set is used to estimate the generalisation error during or after training, allowing the hyperparameters to be updated accordingly.

The book states that since the validation set is used to train the hyperparameters, the validation set will underestimate the generalisation error.

Question: Why? I understand that this is true most of the time, but why would it be true in general?

Thanks in advance!

Best Answer

Say the data is generated by some underlying distribution $f$. We want to learn a model that performs well on future data generated by the same distribution. The true generalization performance of a model is the expected value of the error over $f$. Unfortunately, we only have access to a finite dataset sampled from $f$, which must be used both to train the model (including hyperparameters) and estimate generalization performance. Finite samples are variable--if we were to draw multiple datasets from $f$, each one would be different, and none would perfectly represent the underlying distribution.

Suppose we split the dataset into a training set (used to train the model parameters) and a validation set (used to select the hyperparameters). By tuning the hyperparameters on the validation set, we're selecting a model (out of multiple possibilities) that performs best on that particular sample. In that sense, this operation is not fundamentally different than choosing regular parameters using the training set. Just as it's possible to overfit the training set, it's possible to overfit the validation set. Because samples are variable, a model may have low error on the validation set because it happens to be a good match for the particular, random values in that sample, rather than because it truly matches the underlying distribution $f$. In this case, the model's error on the validation set will be lower than its expected error over $f$. The chance of finding such a model increases as the number of models we select from grows, and as the size of the validation set shrinks.

The remedy to this issue is to estimate generalization performance on an independent subset of the data that has not affected the model in any way (including affecting regular parameters, hyperparemeters, or even preprocessing and decisions by the analyst). For example, the data can be split into independent training, validation, and test sets, or nested cross validation can be used.

For more information about these issues, see:

Cawley and Talbot (2010). On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation.