Solved – How (not) to (over)fit a random forest in R

rrandom forest

I'm reaching out to you because I am unsure whether my implementation of a group of random forests in R (using library randomForest) is valid or whether I have an error in reasoning.

I have a sales dataset with a binary outcome (1: Sale, 0: No Sale) and a set of possibly significant predictors x1-x14. My data is highly imbalanced, with ~124k '0' observations (No Sale) and ~18k '1' observations (Sale). I balance it by randomly cutting down the 124k observations to 18k, as suggested in http://bit.ly/1I7F0AC.

Cross-validation is not necessary due to the nature of random forests, however: In order to find a random forest with a good F-score, I loop through a set of possible predictors and a set of tree-numbers for the forest:

possiblyUsefulPredictors=
  c("x1",..."x14") # Shortened to pseudo-code

treerange=c(1,2,3,4,5,6,7,8,9,10,15,20,25,30,35,40,45,50,60,70,80,90,100,
            200,300,400,500,750,1000)
# Create a multitude of models by looping 
# through different settings for parameters
for (i in 2:length(possiblyUsefulPredictors)){
for (j in treerange){

### Choose model here by setting data, outcome and predictors:
x=possiblyUsefulPredictors[1:i] # Set predictors
ntree=j # Set number of trees
# Tune mtry
bestMtry=tuneRF(x=x, y=y, ntreeTry=1, 
                stepFactor=1, improve=0.01, trace=FALSE, 
                plot=FALSE, doBest=FALSE)    
# Run random forest
rf=randomForest(y=y,x=x,data=df,mtry=bestMtry,ntree=ntree,
type="classification",importance=T)
}
}

I then store model diagnostics precision, recall, and F-score in a table and choose the model that created the highest F-score (13 predictors, 90 trees, mtry=1, which leads to an F-score of 78%).

Specific questions:

  1. Obviously, the way I subset and loop through the predictors is highly arbitrary. Could a more sophisticated approach (e.g. looping through all possible subsets) get me anywhere, or does a random forest inherently choose significant predictors, so that I wouldn't have to try to find a meaningful subset myself (like I do when using step-wise in linear regression)?

  2. By building a set of 416 random forests, do I simply overfit the dataset? I am skeptical that the predictors are as good as my best model suggests.

Thank you and kind regards,
Jan

Best Answer

  1. Random forests take care of choosing subsets, that is the mtry parameter for (the number of features randomly sampled as candidates at each split).

In tuneRF set the ntreeTry parameter as high as your time allows or let it be at default - otherwise you won't get statistically sound results.

  1. You don't need to worry about overfit in case of random forests, just be sure not to use the training data to evaluate model performance (see this post).

As for your treerange parameter, I'd advise it to be well over 100, as much as your machine's performance allows.

If you have so few features, I wouldn't bother at all with feature selection, unless you have performance limits. In that case try Boruta.