Solved – Random forest low score on testing data (scikit-learn)

hyperparameterpythonrandom forestscikit learn

I am trying to train my model using Scikit-learn's Random forest (Regression) and have tried to use GridSearch with Cross-validation (CV=5) to tune hyperparameters. I fixed n_estimators =2000 for all cases. Below are the few searches that I performed.

  1. max_features :[1,3,5], max_depth :[1,5,10,15], min_samples_split:[2,6,8,10], bootstrap:[True, False]

The best were max_features=5, max_depth = 15, min_samples_split:10, bootstrap=True

Best score = 0.8724

Then I searched close to the parameters that were best;

  1. max_features :[3,5,6], max_depth :[10,20,30,40], min_samples_split:[8,16,20,24], bootstrap:[True, False]

The best were max_features=5, max_depth = 30, min_samples_split:20, bootstrap=True

Best score = 0.8722

Again, I searched close to the parameters that were best;

  1. max_features :[2,4,6], max_depth :[25,35,40,50], min_samples_split:[22,28,34,40], bootstrap:[True, False]

The best were max_features=4, max_depth = 25, min_samples_split:22, bootstrap=True

Best score = 0.8725

Then I used GridSearch among the best parameters found in the above runs and found the best on as

max_features=4, max_depth = 15, min_samples_split:10, 

Best score = 0.8729

Then I used these parameters to predict for an unknown dataset but got a very low score (around 0.72).

My questions are;

  • I doing the hyperparameter tuning correctly or I am missing something?

  • Why is my testing score very low as compared to my training and validation score and how can I improve it so that I get good predictions out of my model?

Sorry, if these are basic questions as I am new to scikit-learn and ML.

P.S: The training (+Cross validation data) has 26138 samples with 6 features/inputs (columns) and one output. The testing data has 1416 samples.

Best Answer

I guess you have tunneled in on tuning too many non-useful hyper parameters, because an easy to use grid-search functionality allowed you to do so.

Notice all your explained variances only differ on the fourth digit. You have found, what appears to be a negligible better model settings. But even that you cannot be sure off because:

  • the RF model is non-deterministic and performance will vary slightly
  • a CV only estimates future model performance with a limited precision
  • nfold CV is not perfect reproducible and should be repeated to increase precision
  • Grid tuning should be performed with nested CV, but that is not your problem here I think.

Only "grid-tune" max_features. It has only 6 possoble values. You can run each 5 times and plot it. Check if some setting is repetitively better, probably you find anything from 2-4 perform fine. Max_depth is by default unlimited and that is optimal as long data is not very noisy. You set it to 25, which in practice is unlimited because already $2^{15}$=32000 and you "only" have 26000 samples. Changing these other hyper parameter will only give you shorter training times(useful) and/or more robust models. Thumb-rule: as explained variance is way above 50%, you do not need to make your model more robust by limiting depth of trees (max_depth, min_samples_split) to e.g. 3. Max_depth 15 is quite deep, and probably plenty deep enough, just as 2000 are trees enough. So raising and lowering number of trees and depth within the quite fine range does not change anything, and it will be really hard and non-rewarding to find the true best setting.

So you have performed a grid search and learned that RF will have the same performance in the parameter space you have tested.

If you obtain a testset from a different source you should expect a drop in performance. Your CV only estimate the model performance, if the future test set was drawn from the exactly same population. With 1400 tests, the sample error alone could swing the measured performance +/- 0.03, I guess.

If your swapped e.g. to boosting algorithms grid-tuning of multiple parameters would be a more rewarding tool.

To improve your model maybe you can refine your features. Look to variable importance, to see what features work well. Could you maybe derive new features with an even higher variable importance? Since your explained variance is quite high(low noise), you may benefit from swapping to xgboost. You may also spend time wondering if this chase of a better model performance of some target by some metric (explained variance) is useful specifically for your purpose. Maybe you don't need the model being that accurate when predicting large values, so you log transpose your target e.g. Maybe you only want to rank your predictions so explained variance could be replace with Spearman rank coefficient.

happy modelling:)