I'm reaching out to you because I am unsure whether my implementation of a group of random forests in R (using library randomForest) is valid or whether I have an error in reasoning.
I have a sales dataset with a binary outcome (1: Sale, 0: No Sale) and a set of possibly significant predictors x1-x14. My data is highly imbalanced, with ~124k '0' observations (No Sale) and ~18k '1' observations (Sale). I balance it by randomly cutting down the 124k observations to 18k, as suggested in http://bit.ly/1I7F0AC.
Cross-validation is not necessary due to the nature of random forests, however: In order to find a random forest with a good F-score, I loop through a set of possible predictors and a set of tree-numbers for the forest:
possiblyUsefulPredictors=
c("x1",..."x14") # Shortened to pseudo-code
treerange=c(1,2,3,4,5,6,7,8,9,10,15,20,25,30,35,40,45,50,60,70,80,90,100,
200,300,400,500,750,1000)
# Create a multitude of models by looping
# through different settings for parameters
for (i in 2:length(possiblyUsefulPredictors)){
for (j in treerange){
### Choose model here by setting data, outcome and predictors:
x=possiblyUsefulPredictors[1:i] # Set predictors
ntree=j # Set number of trees
# Tune mtry
bestMtry=tuneRF(x=x, y=y, ntreeTry=1,
stepFactor=1, improve=0.01, trace=FALSE,
plot=FALSE, doBest=FALSE)
# Run random forest
rf=randomForest(y=y,x=x,data=df,mtry=bestMtry,ntree=ntree,
type="classification",importance=T)
}
}
I then store model diagnostics precision, recall, and F-score in a table and choose the model that created the highest F-score (13 predictors, 90 trees, mtry=1, which leads to an F-score of 78%).
Specific questions:
-
Obviously, the way I subset and loop through the predictors is highly arbitrary. Could a more sophisticated approach (e.g. looping through all possible subsets) get me anywhere, or does a random forest inherently choose significant predictors, so that I wouldn't have to try to find a meaningful subset myself (like I do when using step-wise in linear regression)?
-
By building a set of 416 random forests, do I simply overfit the dataset? I am skeptical that the predictors are as good as my best model suggests.
Thank you and kind regards,
Jan
Best Answer
In tuneRF set the ntreeTry parameter as high as your time allows or let it be at default - otherwise you won't get statistically sound results.
As for your treerange parameter, I'd advise it to be well over 100, as much as your machine's performance allows.
If you have so few features, I wouldn't bother at all with feature selection, unless you have performance limits. In that case try Boruta.