Solved – In Naive Bayes, why bother with Laplace smoothing when we have unknown words in the test set

classificationlaplace-smoothingmachine learningnaive bayestext mining

I was reading over Naive Bayes Classification today. I read, under the heading of Parameter Estimation with add 1 smoothing:

Let $c$ refer to a class (such as Positive or Negative), and let $w$ refer to a token or word.

The maximum likelihood estimator for $P(w|c)$ is $$\frac{count(w,c)}{count(c)} = \frac{\text{counts w in class c}}{\text{counts of words in class c}}.$$

This estimation of $P(w|c)$ could be problematic since it would give us probability $0$ for documents with unknown words. A common way of solving
this problem is to use Laplace smoothing.

Let V be the set of words in the training set, add a new element $UNK$ (for unknown) to the set of words.

Define $$P(w|c)=\frac{\text{count}(w,c) +1}{\text{count}(c) + |V| + 1},$$

where $V$ refers to the vocabulary (the words in the training set).

In particular, any unknown word will have probability $$\frac{1}{\text{count}(c) + |V| + 1}.$$

My question is this: why do we bother with this Laplace smoothing at all? If these unknown words that we encounter in the testing set have a probability that is obviously almost zero, ie, $\frac{1}{\text{count}(c) + |V| + 1}$, what is the point of including them in the model? Why not just disregard and delete them?

Best Answer

Let's say you've trained your Naive Bayes Classifier on 2 classes, "Ham" and "Spam" (i.e. it classifies emails). For the sake of simplicity, we'll assume prior probabilities to be 50/50.

Now let's say you have an email $(w_1, w_2,...,w_n)$ which your classifier rates very highly as "Ham", say $$P(Ham|w_1,w_2,...w_n) = .90$$ and $$P(Spam|w_1,w_2,..w_n) = .10$$

So far so good.

Now let's say you have another email $(w_1, w_2, ...,w_n,w_{n+1})$ which is exactly the same as the above email except that there's one word in it that isn't included in the vocabulary. Therefore, since this word's count is 0, $$P(Ham|w_{n+1}) = P(Spam|w_{n+1}) = 0$$

Suddenly, $$P(Ham|w_1,w_2,...w_n,w_{n+1}) = P(Ham|w_1,w_2,...w_n) * P(Ham|w_{n+1}) = 0$$ and $$P(Spam|w_1,w_2,..w_n,w_{n+1}) = P(Spam|w_1,w_2,...w_n) * P(Spam|w_{n+1}) = 0$$

Despite the 1st email being strongly classified in one class, this 2nd email may be classified differently because of that last word having a probability of zero.

Laplace smoothing solves this by giving the last word a small non-zero probability for both classes, so that the posterior probabilities don't suddenly drop to zero.

Related Question