I'm trying to solve task called pedestrian detection and I train binary clasifer on two categories positives – people, negatives – background.
I have dataset:
- number of positives= 3752
- number of negative= 3800
I use train\test split 80\20 % and RandomForestClassifier form scikit-learn
with parameters:
RandomForestClassifier(n_estimators=100, max_depth=50, n_jobs= -1)
I get score: 95.896757 %
test on training data(works perfectly):
true positive: 3005
false positive: 0
false negative: 0
true negative: 3036
test on testing data:
true positive: 742
false positive: 57
false negative: 5
true negative: 707
My question is how to reduce number of false positives(background classified as people)? Also why I have more false positive errors than false negative?
I tried to use class_weight
parameter,but at some point performance degrades(as you can see at class_weight= {0:1,1:4}).
class_weight= {0:1,1:1}
true positive: 3005
false positive: 0
false negative: 0
true negative: 3036
true positive: 742
false positive: 55
false negative: 5
true negative: 709
score: 96.029120 %
class_weight= {0:1,1:2}
true positive: 3005
false positive: 0
false negative: 0
true negative: 3036
true positive: 741
false positive: 45
false negative: 6
true negative: 719
score: 96.624752 %
class_weight= {0:1,1:3}
true positive: 3005
false positive: 0
false negative: 0
true negative: 3036
true positive: 738
false positive: 44
false negative: 9
true negative: 720
score: 96.492389 %
class_weight= {0:1,1:4}
true positive: 3005
false positive: 13
false negative: 0
true negative: 3023
true positive: 735
false positive: 46
false negative: 12
true negative: 718
score: 96.161482 %
class_weight= {0:1,1:5}
true positive: 3005
false positive: 31
false negative: 0
true negative: 3005
true positive: 737
false positive: 48
false negative: 10
true negative: 716
score: 96.161482 %
class_weight= {0:1,1:6}
true positive: 3005
false positive: 56
false negative: 0
true negative: 2980
true positive: 736
false positive: 51
false negative: 11
true negative: 713
score: 95.896757 %
class_weight= {0:1,1:7}
true positive: 3005
false positive: 87
false negative: 0
true negative: 2949
true positive: 734
false positive: 59
false negative: 13
true negative: 705
score: 95.234944 %
Also it is worth noting that RandomForest seems doesn't suffer from unbalanced dataset:
pos= 3752
neg= 10100
class_weight= {0:1,1:1}
true positive: 3007
false positive: 0
false negative: 0
true negative: 8074
true positive: 729
false positive: 71
false negative: 16
true negative: 1955
score: 96.860339 %
class_weight= {0:1,1:2}
true positive: 3007
false positive: 0
false negative: 0
true negative: 8074
true positive: 728
false positive: 59
false negative: 17
true negative: 1967
score: 97.257308 %
class_weight= {0:1,1:3}
true positive: 3007
false positive: 0
false negative: 0
true negative: 8074
true positive: 727
false positive: 58
false negative: 18
true negative: 1968
score: 97.257308 %
Best Answer
I am not an expert when it comes to random forests, I read them quite recently. But from how it looks to me you are overfitting the random forest. What I would do is to use the technique where you use the Out-Of-Bag observations to make predictions. You can find the procedure on these slides: https://lagunita.stanford.edu/c4x/HumanitiesScience/StatLearning/asset/trees.pdf
One other thing I would suggest is also mentioned in these slides called the gradient boosting machine(GBM) also mentioned in this section. I feel that GBM is more intuitive than random forest.
Edit1: I checked it again and it seems bootstrapping is the very first step of GBM. Also, I do not have problems with bootstrapping per se, it is nice and good. The only problem with it is that it can be used very badly.