Solved – Random Forest Learning Curve

cross-validationmachine learningrandom forestvariance

I created the following learning curve in order to diagnose my Random Forest model.

enter image description here

As I can see the curve indicates high variance and 'underfitting' (not overfitting), because cross-validation error is much lower than training error.

What are the general recommendations to improve a given model? For example, I know that increasing the number of training examples might help resolve high variance issue (for the overfitting case). Is it true for my case? What if I don't have more training data?

Best Answer

I suspect you have trained a series of RF regression models and have plotted explained variance(not error) against training set size. Explained variance is the opposite than a error. The value would be between 0 and 1.

Secondly it does not make much sense to diagnose training explained accuracy for a random forest. Samples take the same paths through the trees when training and predicting, so of course a near perfect fit is obtained. That is why out-of-bag training accuracy/error is used.

The cross-validated score increases a little because more samples both lowers bias(deeper trees + denser sampling from data structure) and lowers variance(decreased tree correlation + less sample error).

So everything looks ok and you probably neither have overly over- nor underfitting. I would prefer to (a) simply plot OOB-CV against different settings of hyperparameters or (b) wrap the model in a repeated nested-CV grid search, if you wanna be really thorough. You will probably find the default parameters are close to optimal.

Related Question