Solved – Random Forest has almost perfect training AUC compared to other models

boostinggeneralized linear modelmachine learningrandom forestscikit learn

I'm working on a 2-class classification problem with very unbalanced class size (95% vs. 5%). The overall data size is 500k+ and I did a 70%-30% train test split. So far I have tried the following models (all sklearn):

  1. Logistic regression: train AUC ~0.5, test AUC ~0.5
  2. Gradient boosting: train AUC ~0.74, test AUC ~0.69
  3. Random Forest: train AUC 0.9999999, test AUC ~0.80

I'm seeing a perfect AUC for random forest but only ~0.8 on the testing set. Numbers in #1 and #2 looks much normal to me but I'm really scared of the "perfect" AUC on random forest training set.

Is this something that I should expect or within normal range? Why is this happening to random forest but not to some other classifiers? Are there any reasonable explanation or guess to this?


Update: I have done 10-fold cv and parameter grid search on the random forest model and here's some result:

  1. Random Forest (original): train AUC 0.9999999, test AUC ~0.80
  2. Random Forest (10-fold cv): average test AUC ~0.80
  3. Random Forest (grid search max depth 12): train AUC ~0.73 test AUC ~0.70

I can see that with the optimal parameter settings from grid search, the train and test AUCs are not that different anymore and look normal to me. However, this test AUC of 0.71 is much worse than the test AUC of original random forest (~0.80).

If it's an overfitting problem, after regularization, the test AUC should increase, but it's now the opposite to me, and I'm very confused.

Are there anything I'm missing here? Why is this happening? If I were to choose between the two models, I would choose the one with higher test AUC, which is the "probably" overfitted random forest, does it make sense?

Best Answer

Because the ML algorithms works minimizing the error on the training, the expected accuracy on this data would be "naturally" better than your test results. Effectively when the training error is too low (aka accuracy too high) maybe there is something that has gone wrong (aka overfitting)

As suggested by user5957401, you can try to cross-validate the training process. For example, if you have a good amount of instances, a 10 fold cross-validation would be fine. If you need also to tune hyper parameters, a nested-cross validation would be necessary.

In this way the estimated error from the test-set will be "near" the expected one (aka, the one that you'll get on real Data). In this way, you can check if your result (AUC 0.80 on the test set) is a good estimate, or if you got this by chance

You can try also other techniques, like shuffling several times your data before the cross-validation task, to increase the result reliability.