Solved – How to handle unseen features in a Naive Bayes classifier

I am writing a naive bayes classifier for a text classification problem. I have a bunch of words and an associated label:

[short,snippet,text], label1
[slightly,different,snippet,text], label2
...

I am able to train the naive bayes fine. However, when I am classifying unseen data, sometimes there are unseen features (words). In that case, what happens to the naive bayes formula to determine the probability of a class $C$ given features $F_1,F_2,…$?

$$P(C|F_1,F_2,…) = \frac{P(F_1,F_2,…|C)P(C)}{P(F_1,F_2,…)} = \frac{P(C)\prod_{i}P(F_i|C)}{P(F_1,F_2,…)}$$

Say feature $F_k$ never occured in the training data, then isn't $P(F_k|C)=\frac{0}{0}$?

How is this typically handled in classification problems?

One option is to simply ignore unseen features. However, I would not like to do that, since I am trying to calculate the actual probability score associated with classes. Probabilities should take a hit when there are unseen features, but I am not sure how to do that mathematically.

Any insights, links to reseach articles, etc would really help! Thanks in advance.

Best Answer

Typically one would use Laplace smoothing, essentially adding an artificial observation of every feature for every class. This is done to avoid the issue of having never observed a feature in one class causing a zero that propagates. This is also called a uniform prior.

For a feature never seen ever in any training data, the "uniform prior" means everything will have the same probability (hence uniform without data), and so it will have no impact on which class you select.

In terms of making the decision for your classifier, this would have the equivalent result of just throwing away the novel feature! So that is what you should do. Technically it would change the probability slightly to keep it, but Naive Bayes doesn't give good probabilities in the fist place, so its not worth worrying about.

However, I would not like to do that, since I am trying to calculate the actual probability score associated with classes. Probabilities should take a hit when there are unseen features, but I am not sure how to do that mathematically.

This is a good intuition and correct. But in general, we can't do much when we encounter unobserved features as we intrinsically have no knowledge about them! All you can really do is pick a prior belief and run with that when you don't have data.

If you truly want good probabilities, start looking at logistic regression. Its not perfect either, but the probabilities are much more reasonable than what Naive Bayes will give you.

Best Answer

Related Solutions

Solved – Greater than 1 Naive Bayes Probabilities

Solved – In Naive Bayes, why bother with Laplace smoothing when we have unknown words in the test set

Related Question