Solved – Control parameter minsplit for rpart in regression tree

rpart

I processed rpart( ) on the same dataset. One did not use the control parameter "minsplit", but the other one did. I do not understand why I got the different first node in two processes. My understanding is that the "minsplit" will only extend the branches of the tree, but it should not change the structure of the tree indicated by the significant predictors. Please correct me if my understanding is not right and explain the reason of two different results. Thanks!

1st method – without "minsplit":

rt <- rpart(SO2~Temp+Manuf+Pop+Wind+Precip+Days, data=usair, method="anova")
par(xpd=NA)
plot(rt)
text(rt, use.n=TRUE, all=TRUE)
enter image description here

2nd method – with "minsplit":

rt <- rpart(SO2~Temp+Manuf+Pop+Wind+Precip+Days, minsplit=10, data=usair, method="anova")
par(xpd=NA)
plot(rt)
text(rt, use.n=TRUE, all=TRUE)

enter image description here

Best Answer

By default, 'minsplit' is 20 and determines the minimal number of observations per leaf ('minbucket') as a third of 'minsplit' (see R-help). So in the first plot, since the minimal leaf size is $20/3 \approx 7$, the five very large 'manuf' values are not allowed to be separated from the smaller values. Set 'minbucket' to 3 in both versions would show the expected similarities.

Note that 'significant' has a quite different meaning in statistics than 'important'.