Solved – random forest oob error increase as more tree build

overfittingrandom forest

I am implementing random forest in my code, I find that when I build one tree, the oob error(mean square error since I do regression) is close to zero, while more tree is build, the oob error stablized, this is counter-intuitive, since the textbook teaches me that the oob error should decline as more trees builds, I compare my implementation with R, the my oob is a slightly less than that of R, when mtree=1000, but when mtree=1, my oob is close to zero, while R is quite big.

Simply put, my oob increase as more trees build and stabilized. R and text book shows that oob decrease as more trees build and is stabilized.

So, is there anything wrong with my implementation? how should I tune my algorithm?

Best Answer

Hastie et al. address this question very briefly in Elements of Statistical Learning (page 596).

Another claim is that random forests “cannot overfit” the data. It is certainly true that increasing $\mathcal{B}$ [the number of trees in the ensemble] does not cause the random forest sequence to overfit... However, this limit can overfit the data; the average of fully grown trees can result in too rich a model, and incur unnecessary variance. Segal (2004) demonstrates small gains in performance by controlling the depths of the individual trees grown in random forests. Our experience is that using full-grown trees seldom costs much, and results in one less tuning parameter.

Stated another way, for a fixed hyperparameter configuration, increasing the number of trees cannot overfit the data; however, the other hyperparameters might be a source of overfit.