Solved – How to choose the number of splits in rpart()

cartrrpart

I have used rpart.control for minsplit=2, and got the following results from rpart() function. In order to avoid overfitting the data, do I need to use splits 3 or splits 7? Shouldn't I use splits 7? Please let me know.

Variables actually used in tree construction:

[1] ct_a ct_b usr_a

Root node error: 23205/60 = 386.75

n= 60        

    CP nsplit rel error  xerror     xstd
1 0.615208      0  1.000000 1.05013 0.189409
2 0.181446      1  0.384792 0.54650 0.084423
3 0.044878      2  0.203346 0.31439 0.063681
4 0.027653      3  0.158468 0.27281 0.060605
5 0.025035      4  0.130815 0.30120 0.058992
6 0.022685      5  0.105780 0.29649 0.059138
7 0.013603      6  0.083095 0.21761 0.045295
8 0.010607      7  0.069492 0.21076 0.042196
9 0.010000      8  0.058885 0.21076 0.042196

Best Answer

The convention is to use the best tree (lowest cross-validate relative error) or the smallest (simplest) tree within one standard error of the best tree. The best tree is in row 8 (7 splits), but the tree in row 7 (6 splits) does effectively the same job (xerror for tree in row 7 = 0.21761, which is within (smaller than) the xerror of best tree plus one standard error, xstd, (0.21076 + 0.042196) = 0.252956) and is simpler, hence the 1 standard error rule would select it.

Related Solutions

R – How to Use Recursive Partitioning with rpart() Method in R

Perhaps you misunderstood the message? It is saying that, having built the tree using the control parameters specified, only the variables mpa_a and tc_b have been involved in splits. All the variables were considered, but just these two were needed.

That tree seems quite small; do you have only a small sample of observations? If you want to grow a bigger tree for subsequent pruning back, then you need to alter the minsplit and minbucket control parameters. See ?rpart.control, e.g.:

rm <- rpart(uloss ~ tc_b + ublkb + mpa_a + mpa_b + 
            sys_a + sys_b + usr_a, data = data81, method = "anova",
            control = rpart.control(minsplit = 2, minbucket = 1))

would try to fit a full tree --- but it will be hopelessly over-fitted to the data and you must prune it back using prune(). However, that might assure you that rpart() used all the data.

Solved – Is it possible to have xerror increased in a tree using rpart

xerror means cross validation error. Why validation error increase? Because over fitting. This is exactly what cross validation used for. In your case, it makes perfect sense because, more splits in rpart tree means more complex model, which is more possibilities for over fitting.

Try plotcp function to see, when to overfit and select the "right" tree size.

Best Answer

Related Solutions

R – How to Use Recursive Partitioning with rpart() Method in R

Solved – Is it possible to have xerror increased in a tree using rpart

Related Question