Solved – Is it legitimate to modify the classification of an scikit-learn random forest classifier by changing its default threshold

scikit learn

I am using a random forest binary classifier (in sklearn) in Python to detect anomalous events with an extremely unbalanced class dataset (1% are positive and 99% are negative). My recall score for the positive class is generally above 4%, not very good, but at least better than a random classifier, if I have understood correctly this thread: Good F1 score for anomaly detection.

By using sklearn random forest classifier, I understand that the binary classifier labels an event according to the more probable class, as given by the clf.predict_proba() output. But, given the unbalanced class issue, is it legitimate to override this decision rule so as to, instead, use a threshold to classify an event as positive (say, the probability of positive class being > 0.3). If so, how do I optimize this threshold? Maybe testing different thresholds and seeing their impact on the recall score or the F1 score?

Maybe this procedure is completely out of the question. If so, what are alternative to improve recall and F1 scores given unbalanced class datasets. Some sort of re-sampling technique, or weighting of class (I am unsure of how to do this using a random forest)?

Best Answer

The methodological error here is the use of a threshold. This amounts to the use of an improper scoring rule for comparing classifiers. Instead, you should be comparing classifiers on the basis of a proper scoring rule which emphasizes the qualities you want your models to have, either something like the $c$-statistic or the Brier or cross-entropy or the costs for mis-classifications.