Random Forest – Improving OOB Error Estimate by Decreasing the Number of Features

classificationmachine learningrrandom forest

I am applying a random forest algorithm as a classifier on a microarray dataset which are split into two known groups with 1000s of features. After the initial run I look at the importance of the features and run the tree algorithm again with the 5, 10 and 20 most important features. I find that for all features, top 10 and 20 that the OOB estimate of error rate is 1.19% where as for the top 5 features it is 0%. This seems counter-intuitive to me, so I was wondering whether you could explain whether I am missing something or I am using the wrong metric.

I an using the randomForest package in R with ntree=1000, nodesize=1 and mtry=sqrt(n)

Best Answer

This is feature selection overfit and this is pretty known -- see Ambroise & McLachlan 2002. The problem is based on the facts that RF is too smart and number of objects is too small. In the latter case, it is generally pretty easy to randomly create attribute that may have good correlation with the decision. And when the number of attributes is large, you may be certain that some of totally irrelevant ones will be a very good predictors, even enough to form a cluster that will be able to recreate the decision in 100%, especially when the huge flexibility of RF is considered. And so, it becomes obvious that when instructed to find the best possible subset of attributes, the FS procedure finds this cluster.
One solution (CV) is given in A&McL, you can also test our approach to the topic, the Boruta algorithm, which basically extends the set with "shadow attributes" made to be random by design and compares their RF importance to this obtained for real attributes to judge which of them are indeed random and can be removed; this is replicated many times to be significant. Boruta is rather intended to a bit different task, but as far as my tests showed, the resulting set is free of the FS overfit problem.

Related Solutions

Random Forest – Best Practices for Presenting a Random Forest Model in a Publication

Regarding making it reproducible, the best way is to provide reproducible research (i.e. code and data) along with the paper. Make it available on your website, or on a hosting site (like github).

Regarding visualization, Leo Breiman has done some interesting work on this (see his homepage, in particular the section on graphics).

But if you're using R, then the randomForest package has some useful functions:

data(mtcars)
mtcars.rf <- randomForest(mpg ~ ., data=mtcars, ntree=1000, keep.forest=FALSE,
                           importance=TRUE)
plot(mtcars.rf, log="y")
varImpPlot(mtcars.rf)

And

set.seed(1)
data(iris)
iris.rf <- randomForest(Species ~ ., iris, proximity=TRUE,
                        keep.forest=FALSE)
MDSplot(iris.rf, iris$Species)

I'm not aware of a simple way to actually plot a tree, but you can use the getTree function to retrieve the tree and plot that separately.

getTree(randomForest(iris[,-5], iris[,5], ntree=10), 3, labelVar=TRUE)

The Strobl/Zeileis presentation on "Why and how to use random forest variable importance measures (and how you shouldn’t)" has examples of trees which must have been produced in this way. This blog post on tree models has some nice examples of CART tree plots which you can use for example.

As @chl commented, a single tree isn't especially meaningful in this context, so short of using it to explain what a random forest is, I wouldn't include this in a paper.

Solved – random forest oob error increase as more tree build

Hastie et al. address this question very briefly in Elements of Statistical Learning (page 596).

Another claim is that random forests “cannot overfit” the data. It is certainly true that increasing $\mathcal{B}$ [the number of trees in the ensemble] does not cause the random forest sequence to overfit... However, this limit can overfit the data; the average of fully grown trees can result in too rich a model, and incur unnecessary variance. Segal (2004) demonstrates small gains in performance by controlling the depths of the individual trees grown in random forests. Our experience is that using full-grown trees seldom costs much, and results in one less tuning parameter.

Stated another way, for a fixed hyperparameter configuration, increasing the number of trees cannot overfit the data; however, the other hyperparameters might be a source of overfit.

Best Answer

Related Solutions

Random Forest – Best Practices for Presenting a Random Forest Model in a Publication

Solved – random forest oob error increase as more tree build

Related Question