I have been using GridSearchCV to tune the hyperparameters of three different models. Through hyperparameter tuning I have gotten AUC's of 0.65 (Model A), 0.74 (Model B), and 0.77 (Model C).
However when I return the "best_score_" for each grid search I am getting the scores of 0.72 (Model A), 0.68 (Model B), and 0.71 (Model C).
I am confused about why these scores are noticeably different, for example Model A has the weakest AUC but the strongest "best_score". Is this ok? Does this mean that more tuning likely needs to be done?
Thanks!
Best Answer
There are two main issues here, in my mind.
You're comparing accuracy and AUROC. The default
scoring
inGridSearchCV
uses the model object'sscore
method, which is accuracy for classification models likeRandomForestClassifier
. There's no guarantee that two metrics agree on the best model, and accuracy isn't a great metric. One specific possibility here is that model A does a poor job at rank-ordering compared to the others, but the others perform poorly at the default classification cutoff of 0.5 used for the accuracy metrics.You're comparing test set performance with hyperparameter-selection scores. The
best_score_
is optimistically biased because of the selection process. If one of the selections resulted in a more-overfit model, it might have worse test score drop than others.