Is it possible to control the cost of misclassification in the R package randomForest?
In my own work false negatives (e.g., missing in error that a person may have a disease) are far more costly than false positives. The package rpart allows the user to control misclassification costs by specifying a loss matrix to weight misclassifications differently. Does anything similar exist for randomForest
? Should I, for instance, use the classwt
option to control the Gini criterion?
Best Answer
Not really, if not by manually making RF clone doing bagging of
rpart
models.Some option comes from the fact that the output of RF is actually a continuous score rather than a crisp decision, i.e. the fraction of trees that voted on some class. It can be extracted with
predict(rf_model,type="prob")
and used to make, for instance, a ROC curve which will reveal a better threshold than .5 (which can be later incorporated in RF training withcutoff
parameter).classwt
approach also seems valid, but it does not work very well in practice -- the transition between balanced prediction and trivial casting of the same class regardless of attributes tends to be too sharp to be usable.