AUC Score Comparison – Why Validation AUC Can Be Higher Than Test AUC

cross-validationmachine learningpython

I am creating a RandomForestClassifier model that uses biomarker measurements and clinical measurements to predict a disease (binary). There are an equal amount of people who do and do not develop this disease. I have been using GridSearchCV to tune the hyperparameters with cv=5. Through this cross validation the best AUC score is 0.79. When I apply these changes for the testing set, the AUC score is 0.66.

What does this mean? Does it mean the model is overfitting? If so, how I can I fix that?

Thanks!

Best Answer

You are correct that this suggests over-fitting. (By over-fitting, in this case we mean that the final learner $f_{\text{final}}$ produced by our model selection procedure is unable to generalise in out-of-sample estimates.)

While there is not direct commentary on the sample sizes involved there are some obvious things to consider:

Do a quick EDA on the training sample to ensure the features' distribution between testing and training set is similar. Covariate shift might be an issue.
Perform a variable importance investigation for $f_{\text{final}}$ and specifically focus on most important features. Are they sensible? Our $f_{\text{final}}$ might fit on "noisy" features that contain information that is not available in our testing set. Strobl & Zeileis have a great presentation on "Why and how to use random forest variable importance measures (and how you shouldn't)".
Is $f_{\text{final}}$ actually what we think it is? Sometimes while we cross-validate correctly, we end up training our final learner using wrong parameters.
Try a different learner of a slightly different family (e.g. Gradient boosting or Extremely Randomised Forests) or a totally different type (e.g. a neural network). Does it achieve better coherent between training and testing performance? Maybe it is more robust to some of the latent characteristics of the dataset that cause this over-fitting.
Consider enforcing a more aggressive an early stopping and/or regularisation approach. This might make the underlying results more biased but potentially less variable. In particular for an RF, try something with a relatively small number of relatively shallow trees (e.g. 20 trees of depth 4). Does this learner still have a very noticeable discrepancy between testing and training performance? If it still does then revisit points 1 & 2 with to check more.
Consider looking at the variance of each CV-fold. While the best average AUC score might be ~79%, how did it came to be? Is the the average of 52-53-96-97-98 per fold AUC or the average of 77-78-79-80-81? The latter suggest much more stable performance while the former is indicative of huge (almost nonsensical) variation.
While people don't usually say this first but think if we can get more data. Either straight from the data generating procedure (ideal) or by data augmentation techniques (if reasonable). Some learning methodologies (random forest being indeed one of them) are more data hungry than other (e.g. logistic regression) so to get generalisable performance more data might be very helpful.
As mentioned in the preamble, their might be some sampling variation involved. If the sample is small, maybe using bootstrap instead of CV might be preferable when coming to assessing model performance. (Also, I would use repeated CV instead of single CV unless really computationally constrained; repeated CV is much more robust against something like point 6.) CV.SE has already some great threads on the topic, I would suggest looking at: Internal vs external cross-validation and model selection and Cross validation with test data set to start with.

Obviously this is not an exhaustive list but we can use it to do a quick investigation regarding the causes of the non-generalisable performance behaviour observed.

Related Solutions

Machine Learning – Why AUC Score is Less Than 0.5 for Logistic Regression?

UPDATE: Sycorax posted the following link in the comments: Can a random forest be used for feature selection in multiple linear regression? deals with this problem and describes why this might not work too well.

Similar explanation: your data/model might suffer from the Curse of dimensionality, as logistic regression is prone to fall to this curse.

Several points: (might be comments with enough reputation)

pipe.fit(X_train, y_train)

Where did you define the training data?

Have you tried class_weight="balanced" for logistic regression? This might produce a different rate of misclassification.

What were the results without the RFE step?

Solved – k-fold cross validation AUC score vs test AUC score

You should pay attention to measures of your models' performance on the test data set.

I think that your question stems from confusion over the purpose of the training, validation and test sets. Two related questions on this site you might look at are What is the difference between test set and validation set and Difference between training, test and holdout set data mining model building.

For the specifics of your question, it suffices to run through the purpose of each data set.

Training Data Set: this is the data set that you use to build your model. In this case SVM, RF, LR or k-NN. We don't simply accept this model, however, because it may be underfitted or overfitted to the training data set.
Validation Data Set: this is the data set that you use to select model parameters. Different models have different parameters that might need to be tuned (such a the 'k' in k-NN or the number of trees in RF) and the validation data set is used to choose select the best values for these parameters. Unfortunately, this means that this data is used to train/select the model and performance on the validation data set can't be used to judge the quality of the model for the same reasons that performance on the training data set couldn't be used.
Test Data Set: this is the data set that you use to judge the quality of your model. It has not been seen at any step of the training or model selection process and should provide a good test of your model's ability to capture the relevant patterns in the data while also generalizing to unseen examples.

To summarize: the test data set does not act as a reassurance, it acts as your evaluation. It is the first time that your model is faced with data that has not been used to train/select the model and it is how you are able to judge your model's performance.

Best Answer

Related Solutions

Machine Learning – Why AUC Score is Less Than 0.5 for Logistic Regression?

Solved – k-fold cross validation AUC score vs test AUC score

Related Question