1) How can I change classification threshold (i think it is 0.5 by default) in RandomForest in sklearn?
2) how can I under-sample in sklearn?
3) I have the following result from RandomForest classifier:
[[1635 1297]
[ 520 3624]]
precision recall f1-score support
class 0 0.76 0.56 0.64 2932
class 1 0.74 0.87 0.80 4144
avg / total 0.75 0.74 0.73 7076
first, data is unbalanced (30% from class-0 and 70% from class-1). So, I think the classifier is more likely to be biased for class-1 meaning move some from class-0 to class-1 (there are 1297 missclassification for class-0 but 520 missclassification for class-1). How can I fix this? if downsampling can help? or changing classification threshold?
Update: class-0 has 40% of population while class-1 is 60%. However, drift from class-0 to class-1 (1297) is high while I want this becomes low.
Best Answer
You could indeed wrap you random forest in a class that a
predict
methods that calls thepredict_proba
method of the internal random forest and output class 1 only if it's higher than a custom threshold.Alternatively you can bias the training algorithm by passing a higher
sample_weight
for samples from the minority class.