Solved – Classification & Regression Trees (CART)

cartrandom forest

It seems that most CART examples I encounter involve splitting the sample into a training sample and test sample. This split sample method strikes me as terribly inefficient unless you have a large sample. Are there alternatives to this method if you want to use CART? Or is one better off using logistic regression followed by bootstrapping to assess model over-optimism (as discussed by Harrell)?

Best Answer

You are likely looking for Random Forests. RFs build many, many trees (that's the "forest" part), based on bootstrapping & bagging data and only using random subsets of the attributes (that's the "random" part).

This reduces the overfitting that CARTs are prone to - in fact, RFs are so good at this that they may not require any pruning at all (as CARTs do). Since each tree in a RF uses only a small randomly chosen subset of the available attributes (e.g., only five attributes), they are also very fast in modeling. Finally, you can get an idea of the likely prediction error by looking at out-of-bag errors.

For R, there is the appropriately-named randomForest package on CRAN. You may want to look though questions tagged .