Solved – How to reduce number of false positives

classificationcomputer visionprecision-recallrandom forestunbalanced-classes

I'm trying to solve task called pedestrian detection and I train binary clasifer on two categories positives – people, negatives – background.

I have dataset:

  • number of positives= 3752
  • number of negative= 3800

I use train\test split 80\20 % and RandomForestClassifier form scikit-learn
with parameters:

RandomForestClassifier(n_estimators=100, max_depth=50, n_jobs= -1)

I get score: 95.896757 %

test on training data(works perfectly):

true positive:  3005
false positive:  0
false negative:  0
true negative:  3036

test on testing data:

true positive:  742
false positive:  57
false negative:  5
true negative:  707

My question is how to reduce number of false positives(background classified as people)? Also why I have more false positive errors than false negative?

I tried to use class_weight parameter,but at some point performance degrades(as you can see at class_weight= {0:1,1:4}).

class_weight= {0:1,1:1}
true positive:  3005
false positive:  0
false negative:  0
true negative:  3036

true positive:  742
false positive:  55
false negative:  5
true negative:  709
score: 96.029120 %

class_weight= {0:1,1:2}
true positive:  3005
false positive:  0
false negative:  0
true negative:  3036

true positive:  741
false positive:  45
false negative:  6
true negative:  719
score: 96.624752 %

class_weight= {0:1,1:3}
true positive:  3005
false positive:  0
false negative:  0
true negative:  3036

true positive:  738
false positive:  44
false negative:  9
true negative:  720
score: 96.492389 %

class_weight= {0:1,1:4}
true positive:  3005
false positive:  13
false negative:  0
true negative:  3023

true positive:  735
false positive:  46
false negative:  12
true negative:  718
score: 96.161482 %

class_weight= {0:1,1:5}
true positive:  3005
false positive:  31
false negative:  0
true negative:  3005

true positive:  737
false positive:  48
false negative:  10
true negative:  716
score: 96.161482 %

class_weight= {0:1,1:6}
true positive:  3005
false positive:  56
false negative:  0
true negative:  2980

true positive:  736
false positive:  51
false negative:  11
true negative:  713
score: 95.896757 %

class_weight= {0:1,1:7}
true positive:  3005
false positive:  87
false negative:  0
true negative:  2949

true positive:  734
false positive:  59
false negative:  13
true negative:  705
score: 95.234944 %

Also it is worth noting that RandomForest seems doesn't suffer from unbalanced dataset:

pos= 3752
neg= 10100

class_weight= {0:1,1:1}
true positive: 3007
false positive: 0
false negative: 0
true negative: 8074

true positive:  729
false positive:  71
false negative:  16
true negative:  1955
score: 96.860339 %

class_weight= {0:1,1:2}
true positive:  3007
false positive:  0
false negative:  0
true negative:  8074

true positive:  728
false positive:  59
false negative:  17
true negative:  1967
score: 97.257308 %

class_weight= {0:1,1:3}
true positive:  3007
false positive:  0
false negative:  0
true negative:  8074

true positive:  727
false positive:  58
false negative:  18
true negative:  1968
score: 97.257308 %

Best Answer

I am not an expert when it comes to random forests, I read them quite recently. But from how it looks to me you are overfitting the random forest. What I would do is to use the technique where you use the Out-Of-Bag observations to make predictions. You can find the procedure on these slides: https://lagunita.stanford.edu/c4x/HumanitiesScience/StatLearning/asset/trees.pdf

One other thing I would suggest is also mentioned in these slides called the gradient boosting machine(GBM) also mentioned in this section. I feel that GBM is more intuitive than random forest.

Edit1: I checked it again and it seems bootstrapping is the very first step of GBM. Also, I do not have problems with bootstrapping per se, it is nice and good. The only problem with it is that it can be used very badly.