Solved – How to change threshold for classification in R randomForests

classificationrrandom forestthreshold

All the Species Distribution Modelling literature suggests that when predicting the presence/absence of a species using a model that outputs probabilities (e.g., RandomForests), choice of the threshold probabilitiy by which to actually classify a species as presence or absence is important and one should not always rely on the default of 0.5. I need some help with this! Here is my code:

library(randomForest)
library(PresenceAbsence)

#build model
RFfit <- randomForest(Y ~ x1 + x2 + x3 + x4 + x5, data=mydata, mytry = 2, ntrees=500)

#eventually I will apply this to (predict for) new data but for first I predict back    to training data to compare observed vs. predicted
RFpred <- predict(RFfit, mydata, type = "prob")

#put the observed vs. predicted in the same dataframe
ObsPred <- data.frame(cbind(mydata), Predicted=RFpred)

#create auc.roc plot
auc.roc.plot(ObsPred, threshold = 10, xlab="1-Specificity (false positives)",
  ylab="Sensitivity (true positives)", main="ROC plot", color=TRUE,
  find.auc=TRUE, opt.thresholds=TRUE, opt.methods=9) 

From this I determined that the threshold I would like to use for classifying presence from the predicted probabilities is 0.7, not the default of 0.5.
I don't totally understand what to do with this information.
Do I simply use this threshold when creating a map of my output? I could easily create a mapped output with continuous probabilities then simply reclassify those with values greater than 0.7 as present, and those < 0.7 as absent.

Or, do I want to take this information and re-run my randomForests modeling, using the cut-off parameter? What exactly is the cut-off parameter doing? Does it change the resultant vote? (currently says it is "majority"). How do I use this cut-off parameter? I don't understand the documentation! Thanks!

Best Answer

#set threshold or cutoff value to 0.7

cutoff=0.7

#all values lower than cutoff value 0.7 will be classified as 0 (present in this case)

RFpred[RFpred<cutoff]=0

#all values greater than cutoff value 0.7 will be classified as 1(absent in this case)

 RFpred[RFpred>=cutoff]=1