I'm a little bit confused on the calculation for CP in the summary of an rpart
object.
Take this example
df <- data.frame(x=c(1, 2, 3, 3, 3),
y=factor(c("a", "a", "b", "a", "b")),
method="class")
mytree<-rpart(y ~ x, data = df, minbucket = 1, minsplit=1)
summary(mytree)
Call:
rpart(formula = y ~ x, data = df, minbucket = 1, minsplit = 1)
n= 5
CP nsplit rel error xerror xstd
1 0.50 0 1.0 1 0.5477226
2 0.01 1 0.5 2 0.4472136
Variable importance
x
100
Node number 1: 5 observations, complexity param=0.5
predicted class=a expected loss=0.4 P(node) =1
class counts: 3 2
probabilities: 0.600 0.400
left son=2 (2 obs) right son=3 (3 obs)
Primary splits:
x < 2.5 to the left, improve=1.066667, (0 missing)
For the root node, I would've thought the CP should be 0.4 since the probability of misclassifying an element in the root is 0.4 and the tree size at the root is 0. How is 0.5 the correct CP?
Best Answer
As far as I know, the complexity parameter is not the error in that particular node. It is the amount by which splitting that node improved the relative error. So in your example, splitting the original root node dropped the relative error from 1.0 to 0.5, so the CP of the root node is 0.5. The CP of the next node is only 0.01 (which is the default limit for deciding when to consider splits). So splitting that node only resulted in an improvement of 0.01, so the tree building stopped there.