Solved – High Recall – Low Precision for unbalanced dataset

classificationmachine learningprecision-recallsvmunbalanced-classes

I’m currently encountering some problems analyzing a tweet dataset with support vector machines. The problem is that I have an unbalanced binary class training set (5:2); which is expected to be proportional to the real class distribution. When predicting I get a low precision (0.47) for the minority class in the validation set; recall is 0.88. I tried to use several oversampling and under-sampling methods (performed on the training set) which did not improve the precision since the validation set is unbalanced as well to reflect the real class distribution. I also implemented different costs in the support vector machine, which helped. Now it seems that I cannot improve my performance anymore.

Does anyone of you have some advice what I could do to improve my precision without hurting my recall? Furthermore, does anyone have a clue why I’m getting way more false positives than false negatives (positive is the minority class)?

Best Answer

does anyone have a clue why I’m getting way more false positives than false negatives (positive is the minority class)? Thanks in advance for your help!

Because positive is the minority class. There are a lot of negative examples that could become false positives. Conversely, there are fewer positive examples that could become false negatives.

Recall that Recall = Sensitivity $=\dfrac{TP}{(TP+FN)}$

Sensitivity (True Positive Rate) is related to False Positive Rate (1-specificity) as visualized by an ROC curve. At one extreme, you call every example positive and have a 100% sensitivity with 100% FPR. At another, you call no example positive and have a 0% sensitivity with a 0% FPR. When the positive class is the minority, even a relatively small FPR (which you may have because you have a high recall=sensitivity=TPR) will end up causing a high number of FPs (because there are so many negative examples).

Since

Precision $=\dfrac{TP}{(TP+FP)}$

Even at a relatively low FPR, the FP will overwhelm the TP if the number of negative examples is much larger.

Alternatively,

Positive classifier: $C^+$

Positive example: $O^+$

Precision = $P(O^+|C^+)=\dfrac{P(C^+|O^+)P(O^+)}{P(C^+)}$

P(O+) is low when the positive class is small.

Does anyone of you have some advice what I could do to improve my precision without hurting my recall?

As mentioned by @rinspy, GBC works well in my experience. It will however be slower than SVC with a linear kernel, but you can make very shallow trees to speed it up. Also, more features or more observations might help (for example, there might be some currently un-analyzed feature that is always set to some value in all of your current FP).

It might also be worth plotting ROC curves and calibration curves. It might be the case that even though the classifier has low precision, it could lead to a very useful probability estimate. For example, just knowing that a hard drive might have a 500 fold increased probability of failing, even though the absolute probability is fairly small, might be important information.

Also, a low precision essentially means that the classifier returns a lot of false positives. This however might not be so bad if a false positive is cheap.