Clarification about Cross Validation and Test Set

cross-validationvalidation

I was reading some questions and answer about the reasons and differences for the split of the datasets in Train, Validation and Test Set.
I came up with two questions that I'm not completely sure about the answers, and I will use as an example for the questions a polynomial regressor with D = degree of my polynomial.

  1. Validation Set is only necessary when we have hyperparameters in our model, otherwise validation is useless. Suppose that I can only train a quadratic polynomial, i.e. D = 2, I cannot choose D=1, D=3 or other values. Whats the point here in splitting the Training Set ( and so having less data to train my model) evaluate it using the Validation Set and then evaluate it again but this time using the Test Set? Seems like we are doing the same thing two times in a row. I only see the sense in using a Validation Set when we need to tune the hyperparameters of the model, make a lot of "pre test", one for each value of D, choose the best one and in the end test our final model (suppose D =3) using the Test Set. Does this make sense?

  2. Why its bad to use Training and Test Set multiple times?
    Using validation we should Train and Validate every time for each value of D, and finally test our final model (again, suppose D = 3) using the Test Set. Why its a bad idea to use the whole Training Set (no split in Validation) and Test Set for every value of D?
    For every value of D we train using the whole Training Set, and then test using the Test Set. What is going to tell us the Validation Set that the Test Set cannot tell?

  3. Reason for K-fold Cross Validation Under the reasons for using K-fold Cross Validation instead of a simple Validation there is that if the Validation Set is not big enough we may risk to overfit the Validation Set. Shouldn't be the Training Set that we risk overfit? We are using the Validation set only for evaluation, not for training, so why the risk to overfit it instead of the Training Set?

Best Answer

Validation Set is only necessary when we have hyperparameters in our model, otherwise validation is useless.

You are right in that when no hyperparameters are tuned a single split into training and testing is all you usually do for an internal* generalization error estimate.

Validation is however, a somewhat ambiguous term here (see here for my take on the historical reasons). Do not confuse not having (or not seeing) the middle data set of the famous train/validation/test split with the need of verification and validation of the model in the engineering (or application field) sense of the word. That latter need is not touched at all by the way you organize your model training.

* this internal refers to the fact that training and test data are produced by splitting one larger data set/from the same lab/data source. This again is more the engineering terminology.

Why its bad to use Training and Test Set multiple times?

There is nothing inherently bad in evaluating them multiple times. The trouble arises from

  • multiple use of such evaluations. In particular, any test data whose results are used to steer decisions like model selection becomes part of the training procedure of the selected model and is thus not an independent test result any more. Hence the need for another, (outer) independent test set.
  • However often you evaluate a data set doesn't get around the fact that the data set contains only so many independent cases.
    This again is not wrong in itself, as long as any further conclusions or actions take this into account. But not taking this into account can lead to serious overestimation of the generalization error estiate's quality.

What is going to tell us the Validation Set that the Test Set cannot tell?

  • nothing as long as there is no model selection involved.
  • as soon as there is model selection involved, the test set tells you whether this selection procedure did cause overfitting to the validation set.

Under the reasons for using K-fold Cross Validation instead of a simple Validation there is that if the Validation Set is not big enough we may risk to overfit the Validation Set. Shouldn't be the Training Set that we risk overfit?

no, we're one step further here in our considerations:

We are using the Validation set only for evaluation, not for training, so why the risk to overfit it instead of the Training Set?

When selecting hyperparameters based on the validation set (aka inner test set aka development set aka optimization set) error estimate, the validation set becomes part of the training of the final model.

The risk of overfitting during the hyperparameter estimation increases among other factors with the variance uncertainty of the error estimate used to guide the model selection. This is where k-fold is better than a single split since more cases tested means lower uncertainty due to the finite sample tested.

Another important factor is the number of hyperparater sets you select from (the size of your search space).

From a stats point of view, selecting the best hyperparameter set is a multiple comparison situation, and the more comparisons and the more variance on the performance estimates, the larger the risk to select a model that only accidentally seemed to be better. This is what overfitting to the validation set means.