Solved – Difference between Random Forest and MART

I just wonder what is the difference between Random Forests and MART (Multiple additive regression trees)?

I have read a few articles, e.g.:

[1] L. Breiman. Random Forests. Machine Learning 45 (1): 5–32, 2001.
[2] J.H. Friedman. Greedy function approximation: A gradient boosting machine. Technical Report, IMS Reitz Lecture, Stanford, 1999; see also Annals of Statistics, 2001.

… and a few more.

And I understood the basics.

Random Forest and MART can be both used for either regression or
classification. Random Forest is based on bagging, randomization and
voting. While the output of MART is rather a weighted sum of ensamble
of boosted regression trees.

Therefore, I have a basic idea what the difference is.

However, it would help me a lot, if somebody more experienced could wrap it up.

And one more particular question: If I do a regression using Random Forest, what kind of tree is used? Is it an ensemble of regression trees? Or if it depends on me, which tree I would choose? Or is it a special kind of tree, specific for Random Forests?

Best Answer

First question

When you consider the prediction error through bias-variance decomposition you face the well-know compromise between how well you fit on average (bias) and how stable is you prediction model (variance).

One can improve the error by decreasing the variance or by increasing the bias. Either way you choose, you might loose at the other hand. However, one can hope that overall you improve your prediction error.

Random Forests use trees which usually overfit. The trees used in RF are fully-grown (no pruning). The bias is small since they capture locally the whole details. However fully-grown trees have few prediction power, due to high variance. Variance is high because the trees depends too much on any piece of data. One observation added/removed can give a totally different tree.

Bagging (many averaged trees on bootstrap samples) have the nice result that it creates smoother surfaces (the RF is still a piece-wise model, but the regions produced by averaging are much smaller). The surfaces resulted from averaging are smoother, more stable and thus have less variance. One can link this averaging with Central Limit Theorem where one can note how variance decrease.

A final note on RF is that the averaging works well when the trees resulted are independent. And Breiman et all. realized that by choosing randomly some of the input features will increase the independence between the trees, and have the nice effect that the variance decrease more.

Boosting works the other way, regarding bias-variance decomposition. Boosting starts with a weak-model which have low variance and high bias. The same trees used in boosting are not grown to many nodes, only to a small limit. Sometimes works fine even with decision stumps, which are the simplest form of a tree.

So, boosting starts with a low-variance tree with high bias (learners are weak because they do not capture well the structure of the data) and works gradually to improve bias. MART (I usually prefer the modern name of Gradient Boosting Trees) improve bias by "walking" on the error surface in the direction given by gradients. They usually loose in variance, on this way, but, again, one can hope that the loose in variance is well-compensated by gain in bias term.

As a conclusion these two families (bagging and boosting) are not at all the same. They start their search for the best compromise from opposite ends and works in opposite directions.

Second question

Random Forests uses CART trees, if you want to do it by the book. However were reported very small differences if one uses different trees (like C4.5) with various splitting criteria. As a matter of fact the Weka's implementation of RF uses C4.5 (the name is J4.8 in Weka library). The reason is probably the computation time, caused by smaller trees produced.

Best Answer

Related Solutions

Solved – MCMC sampling of decision tree space vs. random forest

Random Forest vs Boosting – Are Random Forest and Boosting Parametric or Non-Parametric?

Related Question