Solved – How are CP (Cost Complexity) values calculated in RPART (or decision trees in general)

cartrrpart

From what I understand, the cp argument to the rpart function helps pre-prune the tree in the same way as the minsplit or minbucket arguments. What I don't understand is how CP values are computed. For example

df<-data.frame(x=c(1,2,3,3,3,4), y=as.factor(c(TRUE, TRUE, FALSE, TRUE, FALSE, FALSE)), method="class")
mytree<-rpart(y ~ x, data = df, minbucket = 1, minsplit=1)

Resulting tree…

mytree
n= 6 

node), split, n, loss, yval, (yprob)
      * denotes terminal node

1) root 6 3 FALSE (0.5000000 0.5000000)  
  2) x>=2.5 4 1 FALSE (0.7500000 0.2500000) *
  3) x< 2.5 2 0 TRUE (0.0000000 1.0000000) *

Summary…

summary(mytree)

Call:
rpart(formula = y ~ x, data = df, minbucket = 1, minsplit = 1)
  n= 6 

         CP nsplit rel error    xerror      xstd
1 0.6666667      0 1.0000000 2.0000000 0.0000000
2 0.0100000      1 0.3333333 0.6666667 0.3849002

Where's the .666 and .01 coming from?

Best Answer

I was searching for same from many days and I came to know one thing that cp value calculation is taken care by package. By default if you do not specify "CP" value then rpart will take its as 0.01. Cp value is cost of adding node to the tree.

Related Solutions

Solved – Interpreting rpart output for decision trees

http://cran.r-project.org/web/packages/rpart/rpart.pdf plotcp() details: "The set of possible cost-complexity prunings of a tree from a nested set. For the geometric means of the intervals of values of cp for which a pruning is optimal, a cross-validation has (usually) been done in the initial construction by rpart. The cptable in the fit contains the mean and standard deviation of the errors in the cross-validated prediction against each of the geometric means, and these are plotted by this function. A good choice of cp for pruning is often the leftmost value for which the mean lies below the horizontal line." PS library(rpart)

Solved – How do Conditional Inference Trees do binary classification

By default ctree() uses a certain generalized type of association test to determine the split variable to be used. You can easily replicate this with the coin package that implements the same class of association tests for direct use:

library("coin")
independence_test(t20.a ~ p20, teststat = "quad")
##         Asymptotic General Independence Test
## 
## data:  t20.a by p20
## chi-squared = 14.286, df = 1, p-value = 0.0001571

Once the variable is selected, the optimal split point is selected by maximizing the corresponding two-sample association test. More details and further references are given in this discussion: What is the test statistics used for a conditional inference regression tree?

Optionally, it is possible to also use the maximally-selected statistic for the split variable already. By default this is not used because it is somewhat less powerful for monotonic associations and computationally more intensive. However, as your pat363 example illustrates: It can miss certain completely balanced patters (like the XOR or chessboard example). However, any splitting strategy you use will perform better on some patterns and worse on others...

Best Answer

Related Solutions

Solved – Interpreting rpart output for decision trees

Solved – How do Conditional Inference Trees do binary classification

Related Question