Solved – Naive Bayes classifier gives a probability greater than 1

naive bayes

I'm trying to understand an example regarding how to use a Naive Bayes classifier in spam filtering based on this link. The article picks two "bad" words that they figure are in spam a lot and then calculate the probability that a message is spam based on the presence of those words. I tried to extend this example to three words but I'm getting a probability greater than one and I can't see what I'm doing wrong.

Two word case:

$$P(spam|word\;a, word\;b) = \frac{P(spam) \times P(word\;a|spam) \times P(word\;b|spam)}{ P(word\;a) \times P(word\;b)}$$

What I'm doing is pretending that there's a super bad word "foobar" that is present in 100% of the spam messages. Does the Naive Bayes formula expand like so?

Three word case:

$$P(spam|word\;a, word\;b, word\;c) = \frac{P(spam) \times P(word\;a|spam) \times P(word\;b|spam) * P(word\;c|spam)}{P(word\;a) \times P(word\;b) \times P(word\;c)}$$

What's getting me is the fact that in the numerator $P(word\;c|spam) = 1$ and $P(word\;c)$ is some number between $0$ and $1$ such that when I solve the entire equation, dividing by a denominator $< 1$ causes my answer to grow quickly (and beyond $1$). An exaggeration of this can be seen in $0.9 / .000001$.

Am I "breaking" the algorithm by having "foobar" show up 100% of the time in spam messages?

Best Answer

You will not break the algorithm by having a word which shows up in $100\%$ of messages. The forumlas you are using for the probability are wrong. For the two-word case, here is an example to show why. Suppose your words are $a$, $b$, and $x$ and that you have two messages to use to build the classifier. The first message is spam and reads a b. The second message is not spam and reads x. Then

$$P(\text{spam} | \text{word a, word b}) = 1$$

as the only message which contains $a$ and $b$ is spam. But $P(\text{spam})=1/2$ as one out of two messages is spam. Also, $P(\text{word a}|\text{spam}) = 1$ and $P(\text{word b}|\text{spam}) = 1$. Also, $P(\text{word a})= 1/2$ because one out of two messages contains $a$, and similarly $P(\text{word b})=1/2$. So the right hand side of your formula is

$$P(\text{spam})\frac{P(\text{word a}|\text{spam})P(\text{word b}|\text{spam})} {P(\text{word a})P(\text{word b})} = 2,$$

which is not a probability. The correct formula is

$$P(\text{spam} | \text{word a, word b}) = \frac{P(\text{spam, word a, word b})}{P(\text{word a, word b})} = \frac{P(\text{word a, word b}|\text{spam})P(\text{spam})}{P(\text{word a, word b})}.$$

The naive Bayes assumption is that the words $a$ and $b$ appear independently, given that the message is spam (and also, given that the message is not spam). This happens to be true in this example, and the idea behind the naive Bayes classifier is to assume that it's true, in which case the formula becomes

$$\frac{P(\text{word a}|\text{spam})P(\text{word b}|\text{spam})P(\text{spam})}{P(\text{word a, word b})}.$$

Your mistake was to assume that the denominator also becomes $P(\text{word a})P(\text{word b})$ but this is not true because $a$ and $b$ are not independent; they are only independent given that you know whether the message is spam or not. You can see that they are not independent by asking: "Suppose I know that a message contains $a$. Does that tell me anything new about whether it contains $b$?" The answer is yes, certainly, because the only message which contains $a$ also contains $b$. (End of example.)

The confusion arises because people usually don't bother to write the denominator in the naive Bayes formula, as it doesn't affect the calculations except for a scaling factor which is the same for spam as for not spam. You will often see the formula written

$$P(\text{spam} | \text{word a, word b}) \propto P(\text{spam}) P(\text{word a}|\text{spam} ) P(\text{word b}|\text{spam})$$ where the constant of proportionality happens to be $\frac{1}{P(\text{word a, word b})}$. But you can ingnore this constant when classifying a new message. You would simply calculate the right hand sides

$$P(\text{spam}) P(\text{word a}|\text{spam} ) P(\text{word b}|\text{spam})$$ and $$P(\text{not spam}) P(\text{word a}|\text{not spam} ) P(\text{word b}|\text{not spam})$$ and then classify as spam or not spam depending on which of these is bigger.