Solved – Classification & Regression Trees (CART)

cartrandom forest

It seems that most CART examples I encounter involve splitting the sample into a training sample and test sample. This split sample method strikes me as terribly inefficient unless you have a large sample. Are there alternatives to this method if you want to use CART? Or is one better off using logistic regression followed by bootstrapping to assess model over-optimism (as discussed by Harrell)?

Best Answer

You are likely looking for Random Forests. RFs build many, many trees (that's the "forest" part), based on bootstrapping & bagging data and only using random subsets of the attributes (that's the "random" part).

This reduces the overfitting that CARTs are prone to - in fact, RFs are so good at this that they may not require any pruning at all (as CARTs do). Since each tree in a RF uses only a small randomly chosen subset of the available attributes (e.g., only five attributes), they are also very fast in modeling. Finally, you can get an idea of the likely prediction error by looking at out-of-bag errors.

For R, there is the appropriately-named randomForest package on CRAN. You may want to look though questions tagged random-forest.

Related Solutions

Solved – Classification and regression trees (cart)

It looks to me like classregtree is just building a tree, not using any of these methods, all of which are supplementary to tree building. That is, classregtree is implementing the methods described in Breiman et al., per the reference given in the documentation. It builds a tree and then (by default) prunes it.

Solved – CART – Classification And Regression Trees

If you re-run just plotcp you should get the same exact plot. But if you re-run rpart you will get different fits because randomization is involved. You can avoid this by setting a seed before each run of rpart.

e.g.

fit1 <- rpart(Kyphosis ~ Age + Number + Start, data=kyphosis)
plotcp(fit1)

fit2 <- rpart(Kyphosis ~ Age + Number + Start, data=kyphosis)
plotcp(fit2)

will yield two different trees and thus two different fits, and the plot will reflect that. But

plotcp(fit2)
plotcp(fit2)
plotcp(fit2)

should be identical, as should,

set.seed(10020101)
fit1 <- rpart(Kyphosis ~ Age + Number + Start, data=kyphosis)
plotcp(fit1)

set.seed(10020101)
fit2 <- rpart(Kyphosis ~ Age + Number + Start, data=kyphosis)
plotcp(fit2)

Best Answer

Related Solutions

Solved – Classification and regression trees (cart)

Solved – CART – Classification And Regression Trees

Related Question