Solved – Is a negative OOB score possible with scikit-learn’s RandomForestRegressor

out-of-samplerandom forestscikit learn

I'm currently implementing scikit-learn's RandomForestRegressor in Python and am scratching my head over why I have occasionally wound up with negative out-of-bag scores from it. As far as I can tell from the given description of the attribute "_oob_score" ("Score of the training dataset obtained using an out-of-bag estimate") and everything I've read so far, the out-of-bag score should be a positive value.

Extra info:

  • All of the scores I have been getting, both positive and negative, are very small in magnitude (< 0.001; many are < 0.0001). I am also not sure if this is normal, but in general the responses are also small.

  • I'm using 500 trees, and varying min_samples_leaf and max_features. I seem to get the negative values when min_samples_leaf is over ~500.

  • There are about a hundred Boolean columns that were created to deal with categorical data (that are therefore fairly sparsely populated). In contrast, there are about 10 other, numerical columns. Null values have been filled in with a large negative number as a numerical placeholder.

  • My data size is about 1,000,000 rows, with 65% being used for training data and the remainder for testing.

  • (Any other info I can give to help out?)

Is there a statistical interpretation/definition of the out-of-bag score for a random forest for which one would expect a negative score as a possibility, or is this more likely to be a quirk of the program?

Best Answer

RandomForestRegressor's oob_score_ attribute is the score of out-of-bag samples. scikit-learn uses "score" to mean something like "measure of how good a model is", which is different for different models. For RandomForestRegressor (as for most regression models), it's the coefficient of determination, as can be seen by the doc for the score() method.

This is defined as $(1 - u/v)$, where $u$ is the regression's sum squared error $u = \sum_i (y_i - \hat{y}_i)^2$, and $v$ is the sum squared error of the best constant predictor $v = \sum_i (y_i - \bar{y})^2$ (where sums range over the test instances).

This measure can indeed be negative, if $u > v$, i.e. your model is worse than the best constant predictor. This means your model kind of sucks; usually models get positive scores. The score of .0001 or whatever means that your model is only just barely better than the best constant predictor.