Solved – Is it legitimate to modify the classification of an scikit-learn random forest classifier by changing its default threshold

scikit learn

I am using a random forest binary classifier (in sklearn) in Python to detect anomalous events with an extremely unbalanced class dataset (1% are positive and 99% are negative). My recall score for the positive class is generally above 4%, not very good, but at least better than a random classifier, if I have understood correctly this thread: Good F1 score for anomaly detection.

By using sklearn random forest classifier, I understand that the binary classifier labels an event according to the more probable class, as given by the clf.predict_proba() output. But, given the unbalanced class issue, is it legitimate to override this decision rule so as to, instead, use a threshold to classify an event as positive (say, the probability of positive class being > 0.3). If so, how do I optimize this threshold? Maybe testing different thresholds and seeing their impact on the recall score or the F1 score?

Maybe this procedure is completely out of the question. If so, what are alternative to improve recall and F1 scores given unbalanced class datasets. Some sort of re-sampling technique, or weighting of class (I am unsure of how to do this using a random forest)?

Best Answer

The methodological error here is the use of a threshold. This amounts to the use of an improper scoring rule for comparing classifiers. Instead, you should be comparing classifiers on the basis of a proper scoring rule which emphasizes the qualities you want your models to have, either something like the $c$-statistic or the Brier or cross-entropy or the costs for mis-classifications.

Related Solutions

Solved – Logistic-Regression: Prior correction at test time

For any distribution with over binary variable $C$ and continuous variable $x$: \begin{align} p(C_1|x) &= \frac{p(x|C_1)p(C_1)}{p(x)}\\ &= \frac{p(x|C_1)p(C_1)}{p(x|C_1)p(C_1) + p(x|C_2)p(C_2)}\\ &= \frac{1}{1 + \frac{p(x|C_2)p(C_2)}{p(x|C_1)p(C_1)}}\\ &= \frac{1}{1 + \exp\left(\ln\frac{p(x|C_2)p(C_2)}{p(x|C_1)p(C_1)}\right)}\\ &= \frac{1}{1 + \exp\left(-\ln\frac{p(x|C_1)p(C_1)}{p(x|C_2)p(C_2)}\right)}\\ &= \frac{1}{1 + \exp\left(-w^Tx + b\right)}, \end{align} where we define $C_1$ as the event where $C=1$ and $C_2$ as the event where $C=0$. Notice this is the typical hypothesis assumed during binary logistic regression. From the above, we have that \begin{equation} w^Tx + b = \ln\frac{p(x|C_1)p(C_1)}{p(x|C_2)p(C_2)}= \ln\frac{p(x|C_1)}{p(x|C_2)} + \ln\frac{p(C_1)}{p(C_2)}. \end{equation} If, during training, we balance the dataset or weigh the examples inversely to their class prior probabilities, we effectively have that $p(C_1) = p(C_2)$, then the above becomes \begin{equation} w^Tx + b = \ln\frac{p(x|C_1)}{p(x|C_2)}. \end{equation} The parameters $w$ and $b$ are therefore estimated under the assumption that the class prior probabilities are balanced or equal. We can re-introduce the prior log odds: \begin{align} w^Tx + b + \ln\frac{p(C_1)}{p(C_2)} &= \ln\frac{p(x|C_1)}{p(x|C_2)}+\ln\frac{p(C_1)}{p(C_2)}\\ w^Tx + b' &= \ln\frac{p(x|C_1)}{p(x|C_2)}+\ln\frac{p(C_1)}{p(C_2)}, \end{align} where $b' = b + \ln\frac{p(C_1)}{p(C_2)}$. So by a simple adjustment to the bias term, we can re-introduce unbalanced priors in the test/application setting. A similar argument holds for the case of multi-class logistic regression.

Python – Determining Feature Importance in Random Forest Classification

Variable importance accounts for the increase in out-of-bag cross-validated prediction error. It would be possible but not meaningful to account for the change of prediction error by one sample only. As one sample only can be correctly or wrongly predicted, such a term would be very unstable and crude.

You could check out 'local variable importance', 'partial dependence plots' or 'feature contributions'. Here's an example from my package forestFloor using feature contributions. Each plot shows the change of predicted class probability as function each variable. For the iris data set, there no strong variable interactions. Therefore, the model structure can be boiled down to a 2D visualization. The R-sqaured terms quantifies how much the model structure deviates from this main effect only interpretation/visualization.

library(forestFloor)
library(randomForest)
data(iris)
X = iris[,!names(iris) %in% "Species"]
Y = iris[,"Species"]

rf = randomForest(X,Y,
                  keep.forest=TRUE, #mandatory for classification
                  replace=FALSE,    #if TRUE use trimTrees::cinbag, not randomForest
                  keep.inbag=TRUE,  #mandatory always for forestFloor
                  sampsize =15 )    #optional:smaller trees smoother model structure

ff = forestFloor(rf.fit  = rf,           # mandatory
                 X       = X,            # mandatory
                 calc_np = "sad monkey", # this input takes no effect for classification
                 binary_reg = FALSE)     # can change two class classification to regression
# Thus cannot be TRUE for IRIS (three class)

plot(ff,plot_GOF=TRUE,cex=.7,
     colLists=list(c("#FF0000A5"),
                   c("#00FF0050"),
                   c("#0000FF35")))

Related Question