Solved – method to plot the output of a random forest in R

data visualizationmodelingrrandom forestregression

Nice and simple. I've spent two hours googling, reading cross validated, and several r blogs to attempt to find a simple method of outputting the representative tree in R.

I was attempting to demonstrate to a coworker that random forest was producing better results (better in accuracy and more reproducible) than his linear regression on the same data set. However he ultimately said it didn't matter because he'd given up trying to explain it to his superior. His boss wants to use the linear regression model because he can in-turn explain it to his superiors. While essentially they have to trust the output of the random forest.

I recall that it's possible to display a tree producted by a CART model, and in my googling I found you can also simply call plot() on the output of ctree from the cforest package. However I can't seem to find a way to plot the output of randomforest (or cforest) in the same fashion.

Is there a way to do this? Or alternatively is there a known way to extract the tree from the forest to plot using the existing tools?

Best Answer

There are many packages that implement randomForest. Party is one of them that supports plotting

First build a forest:

library("party")
cf <- cforest(Species~., data=iris)

Then extract a tree and build a binary tree that can be plotted:

pt <- prettytree(cf@ensemble[[1]], names(cf@data@get("input"))) 
nt <- new("BinaryTree") 
nt@tree <- pt 
nt@data <- cf@data 
nt@responses <- cf@responses 

plot(nt, type="simple")

Related Solutions

Solved – Random Forest partial plot

Something like that would be my starting assumption, and for many practical examples you would be unlucky, if it turned out to be very wrong. But...

Noise: The more noise, the more conservative predictions(regression towards the mean) the RF will yield. This will introduce a bias, generally reducing the amplitude/steapness of a given partial plot. This should be regarded as a feature, not a bug. Thus the upper flatness, can also be due to few samples and more noise.

Interactions: Partial plotting of the higher dimensional topology of the trained RF model, is suitable only, when there is no dominant interactions with this specific variable. In the extreme case a variable can be highly important, but have a near flat partial function or you could end up with a Simpsons Paradox http://en.wikipedia.org/wiki/Simpson%27s_paradox.

Sample density: Alternatively you could more crudely say overall that y = a log(x) + b . I would recommend to plot an overlay of the training samples. Otherwise it is hard to assess weather a given local 'blop' is most likely due to few samples and some noise or it is actually a sound trend, which deserves to be described in detail.

Did the model use the specific variable much?: If the variable importance of this variable is very low, that would often mean that this variable have not been used much in the trees of the forest. Therefore the reproducibility of the partial function could become more unstable and the pratial function could become more crude. This could happen for noisy environments, sparse environments. It helps a little to lower mtry, such that less superior variables are used more.

Lastly a link to similar question I answered with some code examples for R randomForest: R: What do I see in partial dependence plots of gbm and RandomForest?

Solved – Why Decision tree is outperforming Random Forest in this simple case

In addition to @mariodeng's answer which explains why the random forest trained with default parameters is worse here, here's an explanation why it may not be better than single trees in your experiment anyways:

Aggregated/ensemble models are not universally better than their "single" counterparts, they are better if and only if the single models suffer of instability.

With 1000 training rows and only 3 columns, you are in a comfortable training sample size situation in which even a decision tree may get reasonably stable. (For 3d data you can easily check the variation you have in the assignment of input space to the classes when rerunning the experiment.)

If the predictions of the trees are stable, all submodels in the ensemble return the same prediction and then the prediction of the random forest is just the same as the prediction of each single tree.
So then not only will the overall performance be the same, it will be the same cases that are predicted correctly and wrongly, respectively.

This is the case in your example:

table (predict(dtFit, test) [, 2], predict (rfFit, test))

#     0  1
#  0 46  0
#  1  0 54

why not 100% accurate?

You train on data that is not representative for the test cases: the test cases cover regions of the input space that never appear in the training data. There is no way for a model to know which class (if any - or maybe a 3rd? ...) cases far outside training space should belong to.

Particularly for highly nonlinear partitioning models (such as the decision trees), leaving training space will typically rather sooner than later lead to disaster.

If you plan to train on one class only, you need to look into so-called one-class classifiers which try to establish independent boundaries for each class. One-class classification of your toy data should give you the result that the out-of-training-space cases do not belong to any of the known classes.

Decision trees are a partitioning method, they cannot do one-class classification.