Random Forest – Best Practices for Presenting a Random Forest Model in a Publication

classificationmachine learningmicroarrayrrandom forest

I am using the random forest algorithm as a robust classifier of two groups in a microarray study with 1000s of features.

  • What is the best way to present the random forest so that there is enough information to make it
    reproducible in a paper?
  • Is there a plot method in R to actually plot the tree, if there are a small number of features?
  • Is the OOB estimate of error rate the best statistic to quote?

Best Answer

Regarding making it reproducible, the best way is to provide reproducible research (i.e. code and data) along with the paper. Make it available on your website, or on a hosting site (like github).

Regarding visualization, Leo Breiman has done some interesting work on this (see his homepage, in particular the section on graphics).

But if you're using R, then the randomForest package has some useful functions:

data(mtcars)
mtcars.rf <- randomForest(mpg ~ ., data=mtcars, ntree=1000, keep.forest=FALSE,
                           importance=TRUE)
plot(mtcars.rf, log="y")
varImpPlot(mtcars.rf)

And

set.seed(1)
data(iris)
iris.rf <- randomForest(Species ~ ., iris, proximity=TRUE,
                        keep.forest=FALSE)
MDSplot(iris.rf, iris$Species)

I'm not aware of a simple way to actually plot a tree, but you can use the getTree function to retrieve the tree and plot that separately.

getTree(randomForest(iris[,-5], iris[,5], ntree=10), 3, labelVar=TRUE)

The Strobl/Zeileis presentation on "Why and how to use random forest variable importance measures (and how you shouldn’t)" has examples of trees which must have been produced in this way. This blog post on tree models has some nice examples of CART tree plots which you can use for example.

As @chl commented, a single tree isn't especially meaningful in this context, so short of using it to explain what a random forest is, I wouldn't include this in a paper.