Cross Validation – Should a Training Set Be Used for Grid Search with Cross Validation?

cross-validationhyperparameteroptimizationscikit learn

I'm looking at an example of using grid search in sklearn, and noticed that after doing train-test splits, the author performs grid search using only the training data.

# Split the dataset in two equal parts
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.5, random_state=0)
...

clf = GridSearchCV(SVC(), tuned_parameters, cv=5,
                       scoring='%s_macro' % score)
clf.fit(X_train, y_train)

sklearn's GridSearchCV performs k-fold cross validation as part of the grid search. Given this, wouldn't we want to utilize the entire data set, since CV performs its own validation splits? Is there a concept I'm not understanding here?

Best Answer

It is recommended to hold out a test set that the model only sees at the end, but not during the parameter tuning and model selection steps.
Grid search with cross-validation is especially useful to performs these steps, this is why the author only uses the train data.

If you use your whole data for this step, you will have picked a model and a parameter set that work best for the whole data, including the test set. Hence, this is prone to overfitting.

Usually it is recommended to either:

  • Split your dataset into three folds: train, validation, and test. Then, perform the model selection and hyperparameter search, each time training on the train set, and checking the score on the validation set.
  • Split into two folds: train and test, and then perform cross-validations on the train set to do the model selection and hyperparameter search. This time, you don't have one validation set but as many as you have folds on your CV, so this is more robust (if your model does not take too long to train).