AUC Score Comparison – Why Validation AUC Can Be Higher Than Test AUC

cross-validationmachine learningpython

I am creating a RandomForestClassifier model that uses biomarker measurements and clinical measurements to predict a disease (binary). There are an equal amount of people who do and do not develop this disease. I have been using GridSearchCV to tune the hyperparameters with cv=5. Through this cross validation the best AUC score is 0.79. When I apply these changes for the testing set, the AUC score is 0.66.

What does this mean? Does it mean the model is overfitting? If so, how I can I fix that?

Thanks!

Best Answer

You are correct that this suggests over-fitting. (By over-fitting, in this case we mean that the final learner $f_{\text{final}}$ produced by our model selection procedure is unable to generalise in out-of-sample estimates.)

While there is not direct commentary on the sample sizes involved there are some obvious things to consider:

  1. Do a quick EDA on the training sample to ensure the features' distribution between testing and training set is similar. Covariate shift might be an issue.
  2. Perform a variable importance investigation for $f_{\text{final}}$ and specifically focus on most important features. Are they sensible? Our $f_{\text{final}}$ might fit on "noisy" features that contain information that is not available in our testing set. Strobl & Zeileis have a great presentation on "Why and how to use random forest variable importance measures (and how you shouldn't)".
  3. Is $f_{\text{final}}$ actually what we think it is? Sometimes while we cross-validate correctly, we end up training our final learner using wrong parameters.
  4. Try a different learner of a slightly different family (e.g. Gradient boosting or Extremely Randomised Forests) or a totally different type (e.g. a neural network). Does it achieve better coherent between training and testing performance? Maybe it is more robust to some of the latent characteristics of the dataset that cause this over-fitting.
  5. Consider enforcing a more aggressive an early stopping and/or regularisation approach. This might make the underlying results more biased but potentially less variable. In particular for an RF, try something with a relatively small number of relatively shallow trees (e.g. 20 trees of depth 4). Does this learner still have a very noticeable discrepancy between testing and training performance? If it still does then revisit points 1 & 2 with to check more.
  6. Consider looking at the variance of each CV-fold. While the best average AUC score might be ~79%, how did it came to be? Is the the average of 52-53-96-97-98 per fold AUC or the average of 77-78-79-80-81? The latter suggest much more stable performance while the former is indicative of huge (almost nonsensical) variation.
  7. While people don't usually say this first but think if we can get more data. Either straight from the data generating procedure (ideal) or by data augmentation techniques (if reasonable). Some learning methodologies (random forest being indeed one of them) are more data hungry than other (e.g. logistic regression) so to get generalisable performance more data might be very helpful.
  8. As mentioned in the preamble, their might be some sampling variation involved. If the sample is small, maybe using bootstrap instead of CV might be preferable when coming to assessing model performance. (Also, I would use repeated CV instead of single CV unless really computationally constrained; repeated CV is much more robust against something like point 6.) CV.SE has already some great threads on the topic, I would suggest looking at: Internal vs external cross-validation and model selection and Cross validation with test data set to start with.

Obviously this is not an exhaustive list but we can use it to do a quick investigation regarding the causes of the non-generalisable performance behaviour observed.