Solved – How to find the best size of a decision tree by stratified k-fold cross validation using R

cartcross-validationr

I am novice in an attempt to understand things better.

As I imagine, when one uses stratified k-fold cross validation to decide the size of a decision tree, he:

1) randomly divides data into k equally sized and proportionally distributed data sets

2) aggregates all sets leaving one set out, thus making a train and test data sets

3) using train data set builds a sequence of trees of different sizes

4) tests their results with test data set

5) records misclassification rates of all tree sizes

6) repeats 2-5 steps creating in total k combinations of train and test data sets.

7) sums up misclassification rates for each size of a tree

8) plots Xval error rate as a function of the size of a tree

9) selects the size according to 1-SE rule

If I am correct, this method is useful when data's distribution is highly uneven, since random sampling without equal distribution would result in a case where one class would be dominant or, in extreme scenario, the only one. Therefore, the results of such an effort would be crippled.

However, I can't find a function in R that uses this method to identify the best size of a tree.

So, I attempted to do stratified 10-fold cross validation manually with the aid of the caret and rpart packages, but stumbled upon a problem. I tried making trees of desired size by changing cp control parameter. However, this way each train data set lets create differently sized trees. Therefore I can't accumulate Xval error rates for desired tree sizes.

What I am doing wrong? Maybe there is a simple function to solve my problem?

Best Answer

How you control the size of tree is by the interaction.depth and minoside parameter for training a gbm model using caret. I assume there are corresponding parameters in other decision tree modeling package as well. As the interaction.depth controls the maximum times a split can stack upon another in a single decision tree, minoside controls how many terminal regions(therefore how many splits a tree form).

How you perform grid search in caret is first you construct a grid that have all combination of parameters you want to search. And in trainControl you would specify cross-validation folds and repeats to perform on EACH parameter combination.

So I'm not sure what do you mean by "cp control parameter"? If you have concern with unbalanced dataset that's another issue.