Solved – Greater than 1 Naive Bayes Probabilities

conditional probabilitynaive bayesprobability

I am trying to train a Naive Bayes classifier. In addition to getting the most likely class as an output from the Naive Bayes classifier, I would also like to compute the probabilities associated with labels.

I am making two assumptions: 1) conditional independence of features given the class label, and 2) independence of features. However, the math does not seem to be working out (I get greater than 1 probability for certain labels).

Let's assume we are dealing with two features ($F_1$ and $F_2$). This is the probability I want to compute:

$$P(C|F_1,F_2)$$

Where $C$ is the class. By Bayes rule:

$$P(C|F_1,F_2) = \frac{P(F_1,F_2|C)P(C)}{P(F_1,F_2)}$$

Using the independence assumptions above:

$$P(C|F_1,F_2) = \frac{P(F_1|C)P(F_2|C)P(C)}{P(F_1)P(F_2)}$$

Now, let's say we train the Naive Bayes classifier on the following data:

enter image description here

And we now want to classify a new observation $F_1=1$ and $F_2=1$.

So let's 1st compute $P(C=A|F_1=1,F_2=1)$:

$$P(C=A|F_1=1,F_2=1)=\frac{P(F_1=1|C=A)P(F_2=1|C=A)P(C=A)}{P(F_1=1)P(F_2=1)}=\frac{1*1*\frac{1}{2}}{\frac{1}{2}*\frac{1}{2}}=2$$

Clearly, I have gone wrong somewhere. However, I can't pinpoint it. Any insights would be highly appreciated!

Best Answer

$F_{1}$ and $F_{2}$ are independent given $C$. So the problem is in the denominator. Recall that $P(F_{1},F_{2}) = \sum_{C}P(F_{1},F_{2},C)$.

Example: Training the classifer

To train the classifer, we count up various subsets of points and use them to compute the prior and conditional probabilities.

The priors are trivial: There are sixty total points, forty are green while twenty are red. Thus $$P(class=green)=\frac{40}{60} = 2/3 \text{ and } P(class=red)=\frac{20}{60}=1/3$$

Next, we have to compute the conditional probabilities of each feature-value given a class. Here, there are two features: $feature_1$ and $feature_2$, each of which takes one of two values (A or B for one, X or Y for the other). We therefore need to know the following:

$P(feature_1=A|class=red)$
$P(feature_1=B|class=red)$
$P(feature_1=A|class=green)$
$P(feature_1=B|class=green)$
$P(feature_2=X|class=red)$
$P(feature_2=Y|class=red)$
$P(feature_2=X|class=green)$
$P(feature_2=Y|class=green)$
(in case it's not obvious, this is all possible pairs of feature-value and class)

These are easy to compute by counting and dividing too. For example, for $P(feature_1=A|class=red)$, we look only at the red points and count how many of them are in the 'A' region for $feature_1$. There are twenty red points, all of which are in the 'A' region, so $P(feature_1=A|class=red)=20/20=1$. None of the red points are in the B region, so $P(feature_1|class=red)=0/20=0$. Next, we do the same, but consider only the green points. This gives us $P(feature_1=A|class=green)=5/40=1/8$ and $P(feature_1=B|class=green)=35/40=7/8$. We repeat that process for $feature_2$, to round out the probability table. Assuming I've counted correctly, we get

$P(feature_1=A|class=red)=1$
$P(feature_1=B|class=red)=0$
$P(feature_1=A|class=green)=1/8$
$P(feature_1=B|class=green)=7/8$
$P(feature_2=X|class=red)=3/10$
$P(feature_2=Y|class=red)=7/10$
$P(feature_2=X|class=green)=8/10$
$P(feature_2=Y|class=green)=2/10$

Those ten probabilities (the two priors plus the eight conditionals) are our model

Classifying a New Example

Let's classify the white point from your example. It's in the "A" region for $feature_1$ and the "Y" region for $feature_2$. We want to find the probability that it's in each class. Let's start with red. Using the formula above, we know that: $$P(class=red|example) \propto P(class=red) \cdot P(feature_1=A|class=red) \cdot P(feature_2=Y|class=red)$$ Subbing in the probabilities from the table, we get

$$P(class=red|example) \propto \frac{1}{3} \cdot 1 \cdot \frac{7}{10} = \frac{7}{30}$$ We then do the same for green: $$P(class=green|example) \propto P(class=green) \cdot P(feature_1=A|class=green) \cdot P(feature_2=Y|class=green) $$

Subbing in those values gets us 0 ($2/3 \cdot 0 \cdot 2/10$). Finally, we look to see which class gave us the highest probability. In this case, it's clearly the red class, so that's where we assign the point.

Notes

In your original example, the features are continuous. In that case, you need to find some way of assigning P(feature=value|class) for each class. You might consider fitting then to a known probability distribution (e.g., a Gaussian). During training, you would find the mean and variance for each class along each feature dimension. To classify a point, you'd find $P(feature=value|class)$ by plugging in the appropriate mean and variance for each class. Other distributions might be more appropriate, depending on the particulars of your data, but a Gaussian would be a decent starting point.

I'm not too familiar with the DARPA data set, but you'd do essentially the same thing. You'll probably end up computing something like P(attack=TRUE|service=finger), P(attack=false|service=finger), P(attack=TRUE|service=ftp), etc. and then combine them in the same way as the example. As a side note, part of the trick here is to come up with good features. Source IP , for example, is probably going to be hopelessly sparse--you'll probably only have one or two examples for a given IP. You might do much better if you geolocated the IP and use "Source_in_same_building_as_dest (true/false)" or something as a feature instead.

I hope that helps more. If anything needs clarification, I'd be happy to try again!

Solved – How to handle unseen features in a Naive Bayes classifier

Typically one would use Laplace smoothing, essentially adding an artificial observation of every feature for every class. This is done to avoid the issue of having never observed a feature in one class causing a zero that propagates. This is also called a uniform prior.

For a feature never seen ever in any training data, the "uniform prior" means everything will have the same probability (hence uniform without data), and so it will have no impact on which class you select.

In terms of making the decision for your classifier, this would have the equivalent result of just throwing away the novel feature! So that is what you should do. Technically it would change the probability slightly to keep it, but Naive Bayes doesn't give good probabilities in the fist place, so its not worth worrying about.

However, I would not like to do that, since I am trying to calculate the actual probability score associated with classes. Probabilities should take a hit when there are unseen features, but I am not sure how to do that mathematically.

This is a good intuition and correct. But in general, we can't do much when we encounter unobserved features as we intrinsically have no knowledge about them! All you can really do is pick a prior belief and run with that when you don't have data.

If you truly want good probabilities, start looking at logistic regression. Its not perfect either, but the probabilities are much more reasonable than what Naive Bayes will give you.

Best Answer

Related Solutions

Solved – Understanding Naive Bayes

Example: Training the classifer

Classifying a New Example

Notes

Solved – How to handle unseen features in a Naive Bayes classifier

Related Question