AUC Score – Understanding the Disconnect with GridSearchCV’s Best Score

machine learningpython

I have been using GridSearchCV to tune the hyperparameters of three different models. Through hyperparameter tuning I have gotten AUC's of 0.65 (Model A), 0.74 (Model B), and 0.77 (Model C).

However when I return the "best_score_" for each grid search I am getting the scores of 0.72 (Model A), 0.68 (Model B), and 0.71 (Model C).

I am confused about why these scores are noticeably different, for example Model A has the weakest AUC but the strongest "best_score". Is this ok? Does this mean that more tuning likely needs to be done?

Thanks!

Best Answer

There are two main issues here, in my mind.

  1. You're comparing accuracy and AUROC. The default scoring in GridSearchCV uses the model object's score method, which is accuracy for classification models like RandomForestClassifier. There's no guarantee that two metrics agree on the best model, and accuracy isn't a great metric. One specific possibility here is that model A does a poor job at rank-ordering compared to the others, but the others perform poorly at the default classification cutoff of 0.5 used for the accuracy metrics.

  2. You're comparing test set performance with hyperparameter-selection scores. The best_score_ is optimistically biased because of the selection process. If one of the selections resulted in a more-overfit model, it might have worse test score drop than others.