Solved – Optimal parameters with resampling in random forest

caretclassificationpredictive-modelsrrandom forest

I'm building a classification model in R using random forest and the package caret. I'm interested in which parameters are optimised during resampling.

As an example, lets use the iris dataset and fit two models – one that uses no resampling, and one based on 10-fold cross validation:

set.seed(99)
mod1 <- train(Species ~., data = iris,
          method = "rf",
          ntree = 500,
          tuneGrid = data.frame(mtry=2),
          trControl = trainControl(method = "none"))


set.seed(99)
mod2 <- train(Species ~., data = iris,
          method = "rf",
          ntree = 500,
          tuneGrid = data.frame(mtry=2),
          trControl = trainControl(method = "repeatedcv", number=10,repeats=1))

As we can see, in both models the number of random predictors per split (mtry) is 2, and there are 500 trees generated. Obviously the two models give different results, but what are the parameters that are optimised during cross validaton?

As a comparison, Kuhn in his presentation talks about rpart (slides 52 – 69), where he explains that during resampling we actually prune the tree.

But what about when we're using random forest? Are the generated trees pruned as well, or there are other parameters that are optimised (e.g. max depth)?

Best Answer

You are not optimizing any parameters in your code. The only tuning parameter considered in the caret package is the mtry value, which is specified to be 2 in your code. However, it is still important to get a good estimate of the accuracy of the random forest; model 2 shows the accuracy is around 95.3% using repeated K-fold cross-validation. This is similar to what we get using the out-of-bag (OOB) sample estimate from the random forest:

randomForest(Species ~ ., data=iris, ntree=500, mtry=2)

Random forest does not prune the trees. I believe the only other parameter you may want to optimize in randomForest is the nodesize. This is set to 1 for classification, but Lin and Jeon (2006) found increasing the terminal node size may yield more accurate predictions. You'll need to tune this parameter yourself though (not a tuning parameter in caret package). There are also other tree-based models you can consider (e.g., Gradient Boosting Trees, Extremely Randomized Trees). You can see a list of the tuning parameters on the github page:

http://topepo.github.io/caret/train-models-by-tag.html#Random_Forest