Solved – GridSearchCV returns better parameters with cv=5 than with cv=10

cross-validationmachine learning

I use titanic dataset in this project.

First, I split the train & test data :

training, testing = train_test_split(train, test_size=0.2, stratify=train['Survived'], random_state=0)

I try to find the best parameter for my decision tree by using GridSearchCV :

parameters = {
                'criterion': ['gini', 'entropy'], # di buku, katanya kurang lebih sama
                'min_samples_leaf': [1, 10, 20],
                'max_depth': [3, 6, 9, 12],
                'class_weight': [None, 'balanced'],
                'max_features': [None, 'sqrt', 'log2'],
                'presort': [True],
                'random_state': [0]
             }

clf = tree.DecisionTreeClassifier()

grid_obj = GridSearchCV(clf, parameters, cv=5, scoring='accuracy')
grid_obj = grid_obj.fit(X, y)

The above code, with cv=5 give me 80.4% accuracy on the test set, with 81.8% best_score_.

Then I read that the best k-fold cv for most situation is 10, hence I changed the cv to 10 :

grid_obj = GridSearchCV(clf, parameters, cv=10, scoring='accuracy')

It gives me the new best_estimator_ parameter, but the result is worse. It only got 75.9% accuracy on the test set, with 82.8% best_score_.

I'm wondering, why does this happened? The best_score_ returned by GridSearchCV is almost equal but the test set accuracy decreased by 4.5%

Should I try many different cv parameter each time I train a model?

I understood the basic process of k-fold CV, but not really understand what happened here.

Best Answer

Most probably you are just seeing the uncertainty of your performance estimation.

The cross validation estimates are subject to (at least)

  • some pessimistic bias due to not training on the full data set (I'd expect negligible differences between 5-fold and 10-fold CV)
  • possibly optimistic bias if the splitting leaves dependence between training and test sets (can be large)
  • variance due to instability of the surrogate models (can be measured from repeated/iterated CV results - typically trees suffer from this)
  • variance due to the limited number of tested cases, in your case:
    • 95 % confidence interval for 80 % observed accuracy with a test set of size 262: 75 - 85 %
  • (variance due to lack of representativeness of the data at hand: this does not play a role here as you are interested only in the one existing titanic data)

Now the optimizer has 80 % of the data set, i.e. 1047 cases. If it observes 80 % accuracy, this estimate has about 77 - 82 % range for a 95 % confidence interval based on finite-number-of-tested-cases only. In other words, there may be a number of models in the optimization that the optimizer cannot really distinguish. It will pick a parameter set that appears to be the best but grid search does not guard against the variance sources discussed above. This causes further instability = variance on the model: the hyperparameter choice is then unstable (see also Cawley & Talbot's paper).
You can check this by running GridSearchCV several times with different CV splits and checking the distribution of observed test set accuracy (and also the returned hyperparameters).

In addition, if the CVs were run with new splits into train and test set, already the variance due to testing accuracy with 262 cases can explain the observed difference. Or, statistically speaking, you cannot reject the null hypothesis that both models have equal performance.