Solved – How (not) to (over)fit a random forest in R

rrandom forest

I'm reaching out to you because I am unsure whether my implementation of a group of random forests in R (using library randomForest) is valid or whether I have an error in reasoning.

I have a sales dataset with a binary outcome (1: Sale, 0: No Sale) and a set of possibly significant predictors x1-x14. My data is highly imbalanced, with ~124k '0' observations (No Sale) and ~18k '1' observations (Sale). I balance it by randomly cutting down the 124k observations to 18k, as suggested in http://bit.ly/1I7F0AC.

Cross-validation is not necessary due to the nature of random forests, however: In order to find a random forest with a good F-score, I loop through a set of possible predictors and a set of tree-numbers for the forest:

possiblyUsefulPredictors=
  c("x1",..."x14") # Shortened to pseudo-code

treerange=c(1,2,3,4,5,6,7,8,9,10,15,20,25,30,35,40,45,50,60,70,80,90,100,
            200,300,400,500,750,1000)
# Create a multitude of models by looping 
# through different settings for parameters
for (i in 2:length(possiblyUsefulPredictors)){
for (j in treerange){

### Choose model here by setting data, outcome and predictors:
x=possiblyUsefulPredictors[1:i] # Set predictors
ntree=j # Set number of trees
# Tune mtry
bestMtry=tuneRF(x=x, y=y, ntreeTry=1, 
                stepFactor=1, improve=0.01, trace=FALSE, 
                plot=FALSE, doBest=FALSE)    
# Run random forest
rf=randomForest(y=y,x=x,data=df,mtry=bestMtry,ntree=ntree,
type="classification",importance=T)
}
}

I then store model diagnostics precision, recall, and F-score in a table and choose the model that created the highest F-score (13 predictors, 90 trees, mtry=1, which leads to an F-score of 78%).

Specific questions:

Obviously, the way I subset and loop through the predictors is highly arbitrary. Could a more sophisticated approach (e.g. looping through all possible subsets) get me anywhere, or does a random forest inherently choose significant predictors, so that I wouldn't have to try to find a meaningful subset myself (like I do when using step-wise in linear regression)?
By building a set of 416 random forests, do I simply overfit the dataset? I am skeptical that the predictors are as good as my best model suggests.

Thank you and kind regards,
Jan

Best Answer

Random forests take care of choosing subsets, that is the mtry parameter for (the number of features randomly sampled as candidates at each split).

In tuneRF set the ntreeTry parameter as high as your time allows or let it be at default - otherwise you won't get statistically sound results.

You don't need to worry about overfit in case of random forests, just be sure not to use the training data to evaluate model performance (see this post).

As for your treerange parameter, I'd advise it to be well over 100, as much as your machine's performance allows.

If you have so few features, I wouldn't bother at all with feature selection, unless you have performance limits. In that case try Boruta.

Related Solutions

Solved – mtry and unbalanced use of predictor variables in Random Forest

The part of the overall random forest algorithm that uses mtry is (adapted from The Elements of Statistical Learning):

At each terminal node that is larger than minimal size,

1) Select mtry variables at random from the $p$ regressor variables,

2) From these mtry variables, pick the best variable and split point,

3) Split the node into two daughter nodes using the chosen variable and split point.

As an aside - you can use the tuneRF function in the randomForest package to select the "optimal" mtry for you, using the out-of-bag error estimate as the criterion.

The random selection of variables at each node splitting step is what makes it a random forest, as opposed to just a bagged estimator. Quoting from The Elements of Statistical Learning, p 588 in the second edition:

The idea in random forests ... is to improve the variance reduction of bagging by reducing the correlation between the trees, without increasing the variance too much. This is achieved in the tree-growing process through random selection of the input variables.

There is no incremental increase in bias due to this. Of course, if the model itself is fundamentally biased, e.g., by leaving out important predictor variables, using random forests won't make the situation any better, but it won't make it worse either.

The unbalanced use of predictor variables just reflects the fact that some are less important than others, where important is used in a heuristic rather than a formal sense, and as a consequence, for some trees, may not be used often or at all. For example, think about what would happen if you had a variable that was barely significant on the full data set, but you then generated a lot of bootstrap datasets from the full data set and ran the regression again on each bootstrap dataset. You can imagine that the variable would be insignificant on a lot of those bootstrap datasets. Now compare to a variable that was extremely highly significant on the full dataset; it would likely be significant on almost all of the bootstrap datasets too. So if you counted up the fraction of regressions for which each variable was "selected" by being significant, you'd get an unbalanced count across variables. This is somewhat (but only somewhat) analogous to what happens inside the random forest, only the variable selection is based on "best at each split" rather than "p-value < 0.05" or some such.

EDIT in response to a question by the OP: Note, however, that variable importance measures are not based solely on counts of how many times a variable is used in a split. Consequently, you can have "important" variables (as measured by "importance") that are used less often in splits than less "important" variables (as measured by "importance".) For example, consider the model:

$ y_i = I(x_i > c) + 0.25*z_i^2 + e_i$

as implemented and estimated by the following R code:

x <- runif(500)
z <- rnorm(500)
y <- (x>0.5) + z*z/4 + rnorm(500)
df <- data.frame(list(y=y,x=x,z=z,junk1=rnorm(500),junk2=runif(500),junk3=rnorm(500)))
foo <- randomForest(y~x+z+junk1+junk2+junk3,mtry=2,data=df)
importance(foo)
      IncNodePurity
x         187.38456
z         144.92088
junk1     102.41875
junk2      93.61086
junk3      92.59587

varUsed(foo)
[1] 16916 17445 16883 16434 16453

Here $x$ has higher importance, but $z$ is used more frequently in splits; $x$'s importance is high but in some sense very local, while $z$ is more important over the full range of $z$ values.

For a fuller discussion of random forests, see Chap. 15 of The Elements..., which the link above allows you to download as a pdf for free.

Solved – random forest classification in R – no separation in training set

Because of your sample size, I would recommend using leave one out cross validation to estimate model fit. The algorithm goes like this:

Take out an observation from the data set.
Fit the model.
Estimate the class label of the held out observation.
Repeat steps 1-3 until all observations have been held out.

To see the algorithms performance, look at how well it fit the held out data.

You may find, as Simone suggested, that your algorithm is overfitting the data.

One thing that I have found is that you need much more than 16 observations to fit a good classification model.

Check out the cvTools package to perform the cross validation.

Best Answer

Related Solutions

Solved – mtry and unbalanced use of predictor variables in Random Forest

Solved – random forest classification in R – no separation in training set

Related Question