Solved – classification threshold in RandomForest-sklearn

classificationprecision-recallrandom forestunbalanced-classes

1) How can I change classification threshold (i think it is 0.5 by default) in RandomForest in sklearn?

2) how can I under-sample in sklearn?

3) I have the following result from RandomForest classifier:
[[1635 1297]
[ 520 3624]]

         precision    recall  f1-score   support

class 0       0.76      0.56      0.64      2932
class 1       0.74      0.87      0.80      4144

avg / total 0.75 0.74 0.73 7076

first, data is unbalanced (30% from class-0 and 70% from class-1). So, I think the classifier is more likely to be biased for class-1 meaning move some from class-0 to class-1 (there are 1297 missclassification for class-0 but 520 missclassification for class-1). How can I fix this? if downsampling can help? or changing classification threshold?

Update: class-0 has 40% of population while class-1 is 60%. However, drift from class-0 to class-1 (1297) is high while I want this becomes low.

Best Answer

You could indeed wrap you random forest in a class that a predict methods that calls the predict_proba method of the internal random forest and output class 1 only if it's higher than a custom threshold.

Alternatively you can bias the training algorithm by passing a higher sample_weight for samples from the minority class.