Solved – Why does the naive Bayes classifier only give me probabilities near 0

machine learningnaive bayes

I'm building a text classifier using the naive Bayes formula. I'm still early in the development, but I already see a problem with my technique and I was wondering if you guys would have idea that would help me solve this problem.

What I want to do is score texts to order them from the more likely to be in class A to the less likely to be. I only have one class and I want to find the likelyhood that a text is in it.

The problem is I only get prediction really near zero (1,068E-12 for exemple). the reason is most words have a probability of being in class A inferior to 0.5. Even if I have words with probabilities > 0.5, theses probabilities are farther from 1 then the probabilities <0.5 are farther from 0.

So when I choose the N words with the probabilities the farthest from .05 I usually get only (or at least more) probabilities <0.5. And so the more words I use (N) the more the probability is near 0.

Is there some optimization I could implement that would help with this problem (For now I don't even remove stop words, but I plan to)?

Or is a Bayes classifier a bad choice for my problem?

Best Answer

Naive Bayes generally uses a decision rule like $$ \text{argmax}_{C_i} P(C_i)P(D|C_i), $$ which comes from the fact we can write $$ P(C_i|D) = \frac{P(C_i)P(D|C_i)}{P(D)}. $$ and drop the denominator $P(D)$ since it does not depend on the class. However, since $P(D) << 1$ (i.e. there are many possible documents), neglecting it will cause the output of your algorithm to be quite small, so this isn't necessarily indication your implementation is incorrect.

A practical tip: One thing you can and should do is work with sums of log probabilities rather than products of probabilities to avoid underflow errors. Rather than doing $$ P(D|C_i) = \prod_{w_j \in D} P(w_j|C_i), $$ do $$ \log P(D|C_i) = \sum_{w_j \in D} \log P(w_j|C_i). $$ You'll also need to deal with unseen words as well since zero probabilities will give you problems.