I use titanic dataset in this project.
First, I split the train & test data :
training, testing = train_test_split(train, test_size=0.2, stratify=train['Survived'], random_state=0)
I try to find the best parameter for my decision tree by using GridSearchCV :
parameters = {
'criterion': ['gini', 'entropy'], # di buku, katanya kurang lebih sama
'min_samples_leaf': [1, 10, 20],
'max_depth': [3, 6, 9, 12],
'class_weight': [None, 'balanced'],
'max_features': [None, 'sqrt', 'log2'],
'presort': [True],
'random_state': [0]
}
clf = tree.DecisionTreeClassifier()
grid_obj = GridSearchCV(clf, parameters, cv=5, scoring='accuracy')
grid_obj = grid_obj.fit(X, y)
The above code, with cv=5
give me 80.4% accuracy on the test set, with 81.8% best_score_
.
Then I read that the best k-fold cv for most situation is 10, hence I changed the cv to 10 :
grid_obj = GridSearchCV(clf, parameters, cv=10, scoring='accuracy')
It gives me the new best_estimator_
parameter, but the result is worse. It only got 75.9% accuracy on the test set, with 82.8% best_score_
.
I'm wondering, why does this happened? The best_score_
returned by GridSearchCV
is almost equal but the test set accuracy decreased by 4.5%
Should I try many different cv
parameter each time I train a model?
I understood the basic process of k-fold CV, but not really understand what happened here.
Best Answer
Most probably you are just seeing the uncertainty of your performance estimation.
The cross validation estimates are subject to (at least)
Now the optimizer has 80 % of the data set, i.e. 1047 cases. If it observes 80 % accuracy, this estimate has about 77 - 82 % range for a 95 % confidence interval based on finite-number-of-tested-cases only. In other words, there may be a number of models in the optimization that the optimizer cannot really distinguish. It will pick a parameter set that appears to be the best but grid search does not guard against the variance sources discussed above. This causes further instability = variance on the model: the hyperparameter choice is then unstable (see also Cawley & Talbot's paper).
You can check this by running
GridSearchCV
several times with different CV splits and checking the distribution of observed test set accuracy (and also the returned hyperparameters).In addition, if the CVs were run with new splits into train and test set, already the variance due to testing accuracy with 262 cases can explain the observed difference. Or, statistically speaking, you cannot reject the null hypothesis that both models have equal performance.