Solved – Cross Validation and test ROC AUC scores match but train score doesn’t

classificationcross-validationrandom forest

I have a dataset of about ~49K entries and 31 columns. I ran a grid search with 3-fold CV for finding the hyperparameters of Random Forest (using sci-kit learn). I then used the best estimator to fit on the train set, and predict on the test set. The results of ROC AUC score are as follows:

CV: 0.705
Train: 0.836
Test: 0 .721

Can this be considered as overfitted? If so, what measures can I take to remedy this? So far, I have been spanning over n_estimators and max_depth. The model always seems to choose the maximum depth possible, and the difference between these scores increasing. I apply class weights to balance the dataset.

Best Answer

You should plot the misclassification rate for the train and test, but my initial guess is that yes, you've overfitted the model.

You could try tuning a XGBoost with different learning rates, and try out a different train proportions or/and minimum split sizes. Also try learning with as primitive weak learner as possible.