Regarding making it reproducible, the best way is to provide reproducible research (i.e. code and data) along with the paper. Make it available on your website, or on a hosting site (like github).
Regarding visualization, Leo Breiman has done some interesting work on this (see his homepage, in particular the section on graphics).
But if you're using R, then the randomForest
package has some useful functions:
data(mtcars)
mtcars.rf <- randomForest(mpg ~ ., data=mtcars, ntree=1000, keep.forest=FALSE,
importance=TRUE)
plot(mtcars.rf, log="y")
varImpPlot(mtcars.rf)
And
set.seed(1)
data(iris)
iris.rf <- randomForest(Species ~ ., iris, proximity=TRUE,
keep.forest=FALSE)
MDSplot(iris.rf, iris$Species)
I'm not aware of a simple way to actually plot a tree, but you can use the getTree
function to retrieve the tree and plot that separately.
getTree(randomForest(iris[,-5], iris[,5], ntree=10), 3, labelVar=TRUE)
The Strobl/Zeileis presentation on "Why and how to use random forest variable importance measures (and how you shouldn’t)" has examples of trees which must have been produced in this way. This blog post on tree models has some nice examples of CART tree plots which you can use for example.
As @chl commented, a single tree isn't especially meaningful in this context, so short of using it to explain what a random forest is, I wouldn't include this in a paper.
Hastie et al. address this question very briefly in Elements of Statistical Learning (page 596).
Another claim is that random forests “cannot overfit” the data. It is certainly true that increasing $\mathcal{B}$ [the number of trees in the ensemble] does not cause the random forest sequence to overfit... However, this limit can overfit the data; the average of fully grown trees can result in too rich a model, and incur unnecessary variance. Segal (2004) demonstrates small gains in performance by controlling the depths of the individual trees grown in random forests. Our experience is that using full-grown trees seldom costs much, and results in one less tuning parameter.
Stated another way, for a fixed hyperparameter configuration, increasing the number of trees cannot overfit the data; however, the other hyperparameters might be a source of overfit.
Best Answer
This is feature selection overfit and this is pretty known -- see Ambroise & McLachlan 2002. The problem is based on the facts that RF is too smart and number of objects is too small. In the latter case, it is generally pretty easy to randomly create attribute that may have good correlation with the decision. And when the number of attributes is large, you may be certain that some of totally irrelevant ones will be a very good predictors, even enough to form a cluster that will be able to recreate the decision in 100%, especially when the huge flexibility of RF is considered. And so, it becomes obvious that when instructed to find the best possible subset of attributes, the FS procedure finds this cluster.
One solution (CV) is given in A&McL, you can also test our approach to the topic, the Boruta algorithm, which basically extends the set with "shadow attributes" made to be random by design and compares their RF importance to this obtained for real attributes to judge which of them are indeed random and can be removed; this is replicated many times to be significant. Boruta is rather intended to a bit different task, but as far as my tests showed, the resulting set is free of the FS overfit problem.