Solved – MCMC sampling of decision tree space vs. random forest

cartmarkov-chain-montecarlomonte carlorandom forest

A random forest is a collection of decision trees formed by randomly selecting only certain features to build each tree with (and sometimes bagging the training data). Apparently they learn and generalize well. Has anybody done MCMC sampling of the decision tree space or compared them to random forests? I know it might be computationally more expensive to run the MCMC and save all the sampled trees, but I am interested in the theoretical features of this model, not the computational costs. What I mean is something like this:

Construct a random decision tree (It would probably perform horribly)
Compute likelihood of the tree with something like $P(Tree|Data) \propto P(Data|Tree)$, or perhaps add a $P_{prior}(Tree)$ term.
Choose a random step to change the tree and select based on the likelihood $P(Tree|Data)$.
Every N steps, save a copy of the current tree
Go back to 3 for some large N*M times
Use the collection of M saved trees to do prediction

Would this give a similar performance to Random Forests? Note that here we are not throwing away good data or features at any step unlike random forests.

Best Answer

This was done some 13 years ago by Chapman, George and McCulloch (1998, JASA). Of course there's been huge literature on Bayesian regression trees that grew out of this idea.

Related Solutions

Solved – Scalable Random Forest For Massive Data

There was a PhD on random forests which included RF on large datasets - it is available at https://github.com/glouppe/phd-thesis I no longer remember the solution proposed but I remember that there was some comparative experiments on different alternatives.

Solved – Why Decision tree is outperforming Random Forest in this simple case

In addition to @mariodeng's answer which explains why the random forest trained with default parameters is worse here, here's an explanation why it may not be better than single trees in your experiment anyways:

Aggregated/ensemble models are not universally better than their "single" counterparts, they are better if and only if the single models suffer of instability.

With 1000 training rows and only 3 columns, you are in a comfortable training sample size situation in which even a decision tree may get reasonably stable. (For 3d data you can easily check the variation you have in the assignment of input space to the classes when rerunning the experiment.)

If the predictions of the trees are stable, all submodels in the ensemble return the same prediction and then the prediction of the random forest is just the same as the prediction of each single tree.
So then not only will the overall performance be the same, it will be the same cases that are predicted correctly and wrongly, respectively.

This is the case in your example:

table (predict(dtFit, test) [, 2], predict (rfFit, test))

#     0  1
#  0 46  0
#  1  0 54

why not 100% accurate?

You train on data that is not representative for the test cases: the test cases cover regions of the input space that never appear in the training data. There is no way for a model to know which class (if any - or maybe a 3rd? ...) cases far outside training space should belong to.

Particularly for highly nonlinear partitioning models (such as the decision trees), leaving training space will typically rather sooner than later lead to disaster.

If you plan to train on one class only, you need to look into so-called one-class classifiers which try to establish independent boundaries for each class. One-class classification of your toy data should give you the result that the out-of-training-space cases do not belong to any of the known classes.

Decision trees are a partitioning method, they cannot do one-class classification.

Related Question