Solved – Reason for higher AUC from a test set than a training set using a random forest

aucmachine learningrandom forestroc

I made a 70:30 split of the data to build a random forest model for binary classification. Although the prevalence of $Y=1$ was about 25% in both training and test sets, the two sets became imbalanced while building the model and making predictions due to missingness in covariates. I observed that the "complete" training set had half the $Y=1$ cases compared to the "complete" test set.

The AUC for the training data was about 0.70 and the AUC of the test data was about 0.85.

How should I explain that? I thought the training data would always show higher AUC than the test data because we used training data to build our model.

Best Answer

This can be easily attributed to random variation. While, indeed the in-sample performance is expected better than the out-of-sample performance (i.e. our training error be less than our test error), that is not a necessity; as the AUC value calculated here is a statistic, a function of our present sample, it is subject to sampling variability. It would be reasonable to use multiple training/test splits (i.e. bootstrap the sample at hand) so we are able to quantify the variability of that statistic. Repeated cross-validation and/or bootstrapping are standard approaches to estimate the sampling distribution of a statistic of interest. There a very informative thread in CV on: Hold-out validation vs. cross-validation that I think will help clarify things even further.