Rpart – Understanding and Resolving Complexity Parameter Confusion in CART

cartrrpart

I'm a little bit confused on the calculation for CP in the summary of an rpart object.

Take this example

df <- data.frame(x=c(1, 2, 3, 3, 3), 
                 y=factor(c("a", "a", "b", "a", "b")),
                 method="class")
mytree<-rpart(y ~ x, data = df, minbucket = 1, minsplit=1)
summary(mytree)

Call:
rpart(formula = y ~ x, data = df, minbucket = 1, minsplit = 1)
  n= 5 

    CP nsplit rel error xerror      xstd
1 0.50      0       1.0      1 0.5477226
2 0.01      1       0.5      2 0.4472136

Variable importance
  x 
100 

Node number 1: 5 observations,    complexity param=0.5
  predicted class=a  expected loss=0.4  P(node) =1
    class counts:     3     2
   probabilities: 0.600 0.400 
  left son=2 (2 obs) right son=3 (3 obs)
  Primary splits:
      x < 2.5 to the left,  improve=1.066667, (0 missing)

For the root node, I would've thought the CP should be 0.4 since the probability of misclassifying an element in the root is 0.4 and the tree size at the root is 0. How is 0.5 the correct CP?

Best Answer

As far as I know, the complexity parameter is not the error in that particular node. It is the amount by which splitting that node improved the relative error. So in your example, splitting the original root node dropped the relative error from 1.0 to 0.5, so the CP of the root node is 0.5. The CP of the next node is only 0.01 (which is the default limit for deciding when to consider splits). So splitting that node only resulted in an improvement of 0.01, so the tree building stopped there.