R – How to Use Recursive Partitioning with rpart() Method in R

rrpart

I am new to R and using rpart for building a regression tree for my data.I wanted to use all the input variables for building the tree, but the rpart method using only a couple of inputs as shown below. As we can see, I have provided 10 inputs, but rpart used only two inputs. Please let me know how can force rpart method to use all the input variables. Thanks.

rm = rpart(uloss ~ tc_b + ublkb + mpa_a + mpa_b + 
     sys_a + sys_b + usr_a, data = data81, method="anova")
> princtp(rm)  

Regression tree:
rpart(formula = uloss ~ tc_b + ublkb + mpa_a + mpa_b + sys_a + 
    sys_b, data = data81, weights = usr_a, method = "anova")

Variables actually used in tree construction:
[1] mpa_a tc_b     
Root node error: 647924/81 = 7999
n= 81    
       CP nsplit rel error  xerror     xstd
1 0.403169      0   1.00000 1.04470 0.025262
2 0.092390      1   0.59683 0.66102 0.015238
3 0.081084      2   0.50444 0.70702 0.013123
4 0.045304      3   0.42336 0.58683 0.012129
5 0.010000      4   0.37805 0.51930 0.011942

One more question:

I have used rpart.control for minsplit=2, and got the following for another data.
Inorder to avoid overfititng the data, do I need to use splits 3 or splits 7. Shouldn't I use splits 7? Please let me know.

Variables actually used in tree construction:
[1] ct_a ct_b usr_a

Root node error: 23205/60 = 386.75

n= 60

        CP nsplit rel error  xerror     xstd
1 0.615208      0  1.000000 1.05013 0.189409
2 0.181446      1  0.384792 0.54650 0.084423
3 0.044878      2  0.203346 0.31439 0.063681
4 0.027653      3  0.158468 0.27281 0.060605
5 0.025035      4  0.130815 0.30120 0.058992
6 0.022685      5  0.105780 0.29649 0.059138
7 0.013603      6  0.083095 0.21761 0.045295
8 0.010607      7  0.069492 0.21076 0.042196
9 0.010000      8  0.058885 0.21076 0.042196

Best Answer

Perhaps you misunderstood the message? It is saying that, having built the tree using the control parameters specified, only the variables mpa_a and tc_b have been involved in splits. All the variables were considered, but just these two were needed.

That tree seems quite small; do you have only a small sample of observations? If you want to grow a bigger tree for subsequent pruning back, then you need to alter the minsplit and minbucket control parameters. See ?rpart.control, e.g.:

rm <- rpart(uloss ~ tc_b + ublkb + mpa_a + mpa_b + 
            sys_a + sys_b + usr_a, data = data81, method = "anova",
            control = rpart.control(minsplit = 2, minbucket = 1))

would try to fit a full tree --- but it will be hopelessly over-fitted to the data and you must prune it back using prune(). However, that might assure you that rpart() used all the data.