Solved – Is it possible to have xerror increased in a tree using rpart

cartcross-validationrrpart

I am new to R and rpart package. When I plot the tree using rpart:

> temp_control <- rpart.control(xval=10, minbucket=2, minsplit=4, cp=0.0001)
> dfit <- rpart(Target~., data = temp_data, method = 'class', control=temp_control)
> printcp(dfit)

Then I get :

          CP nsplit rel error xerror     xstd
1 0.00189329      0   1.00000 1.0000 0.040140
2 0.00172117     28   0.92255 1.0861 0.041708
3 0.00114745     32   0.91566 1.0947 0.041861
4 0.00098353     41   0.90534 1.1102 0.042133
5 0.00086059     48   0.89845 1.1274 0.042433
6 0.00043029     62   0.88640 1.1515 0.042849
7 0.00034423     75   0.87952 1.1635 0.043055
8 0.00028686     80   0.87780 1.1687 0.043142
9 0.00010000     89   0.87263 1.1807 0.043346

Why does xerror increase with the growth of a tree? Do I need some more adjustment of the parameters? Also, I am wondering how the root node error is calculated. Is it only related to a certain dataset? Does it have any relationship with parameters setting?

Actually, I tried "anova" method although my response variable is categorical (Y/N). I just changed them to 0/1 and run "anova", then I can get :

         CP    nsplit rel error  xerror     xstd
1   3.1473e-02      0   1.00000 1.00025 0.037408
2   1.1506e-02      1   0.96853 0.97164 0.035479
3   5.6396e-03      2   0.95702 0.96528 0.035172
4   4.6137e-03      3   0.95138 0.96970 0.035029
5   4.4412e-03      6   0.93754 0.97246 0.035019
6   4.3751e-03      7   0.93310 0.97006 0.034915
7   4.1352e-03     10   0.91997 0.97109 0.034912
8   3.5702e-03     11   0.91584 0.97316 0.034847
9   3.0148e-03     14   0.90513 0.96819 0.034671
10  2.5334e-03     15   0.90211 0.96872 0.034725
11  2.2789e-03     16   0.89958 0.96959 0.034753
12  2.2342e-03     17   0.89730 0.97437 0.034829
13  1.8732e-03     18   0.89507 0.98647 0.035104
14  1.8401e-03     19   0.89319 0.99511 0.035199

Anyone has any idea about this?

Best Answer

xerror means cross validation error. Why validation error increase? Because over fitting. This is exactly what cross validation used for. In your case, it makes perfect sense because, more splits in rpart tree means more complex model, which is more possibilities for over fitting.

Try plotcp function to see, when to overfit and select the "right" tree size.