Solved – How to control the cost of misclassification in Random Forests

classificationloss-functionsmetricrrandom forest

Is it possible to control the cost of misclassification in the R package randomForest?

In my own work false negatives (e.g., missing in error that a person may have a disease) are far more costly than false positives. The package rpart allows the user to control misclassification costs by specifying a loss matrix to weight misclassifications differently. Does anything similar exist for randomForest? Should I, for instance, use the classwt option to control the Gini criterion?

Best Answer

Not really, if not by manually making RF clone doing bagging of rpart models.

Some option comes from the fact that the output of RF is actually a continuous score rather than a crisp decision, i.e. the fraction of trees that voted on some class. It can be extracted with predict(rf_model,type="prob") and used to make, for instance, a ROC curve which will reveal a better threshold than .5 (which can be later incorporated in RF training with cutoff parameter).

classwt approach also seems valid, but it does not work very well in practice -- the transition between balanced prediction and trivial casting of the same class regardless of attributes tends to be too sharp to be usable.

Related Question