Solved – Understanding Naive Bayes

machine learningnaive bayes

From StatSoft, Inc. (2013), Electronic Statistics Textbook, "Naive Bayes Classifier":

enter image description here

To demonstrate the concept of Naïve Bayes Classification, consider the
example displayed in the illustration above. As indicated, the objects
can be classified as either GREEN or RED. My task is to classify new
cases as they arrive, i.e., decide to which class label they belong,
based on the currently exiting objects.

Since there are twice as many GREEN objects as RED, it is reasonable
to believe that a new case (which hasn't been observed yet) is twice
as likely to have membership GREEN rather than RED. In the Bayesian
analysis, this belief is known as the prior probability. Prior
probabilities are based on previous experience, in this case the
percentage of GREEN and RED objects, and often used to predict
outcomes before they actually happen.

Thus, we can write:

enter image description here

Since there is a total of 60 objects, 40 of which are GREEN and 20
RED, our prior probabilities for class membership are:

enter image description here enter image description here

Having formulated our prior probability, we are now ready to classify
a new object (WHITE circle). Since the objects are well clustered, it
is reasonable to assume that the more GREEN (or RED) objects in the
vicinity of X, the more likely that the new cases belong to that
particular color. To measure this likelihood, we draw a circle around
X which encompasses a number (to be chosen a priori) of points
irrespective of their class labels. Then we calculate the number of
points in the circle belonging to each class label. From this we
calculate the likelihood:

enter image description here

From the illustration above, it is clear that Likelihood of X given
GREEN is smaller than Likelihood of X given RED, since the circle
encompasses 1 GREEN object and 3 RED ones. Thus:

enter image description here

enter image description here

Although the prior probabilities indicate that X may belong to GREEN
(given that there are twice as many GREEN compared to RED) the
likelihood indicates otherwise; that the class membership of X is RED
(given that there are more RED objects in the vicinity of X than
GREEN). In the Bayesian analysis, the final classification is produced
by combining both sources of information, i.e., the prior and the
likelihood, to form a posterior probability using the so-called Bayes'
rule (named after Rev. Thomas Bayes 1702-1761).

enter image description here

Finally, we classify X as RED since its class membership achieves the
largest posterior probability.

This is where the difficulty of my maths understanding comes in.

enter image description here

p(Cj | x1,x2,x…,xd) is the posterior probability of class membership, i.e., the probability that X belongs to Cj but why write it like this?

Calculating the likelihood?

enter image description here

Posterior Probability?

enter image description here

I never took math, but my understanding of naive bayes is fine I think just when it comes to these decomposed methods confuses me. Could some one help with visualizing these methods and how to write the math out in an understandable way?

Best Answer

I'm going to run through the whole Naive Bayes process from scratch, since it's not totally clear to me where you're getting hung up.

We want to find the probability that a new example belongs to each class: $P(class|feature_1, feature_2,..., feature_n$). We then compute that probability for each class, and pick the most likely class. The problem is that we usually don't have those probabilities. However, Bayes' Theorem lets us rewrite that equation in a more tractable form.

Bayes' Thereom is simply$$P(A|B)=\frac{P(B|A) \cdot P(A)}{P(B)}$$ or in terms of our problem: $$P(class|features)=\frac{P(features|class) \cdot P(class)}{P(features)}$$

We can simplify this by removing $P(features)$. We can do this because we're going to rank $P(class|features)$ for each value of $class$; $P(features)$ will be the same every time--it doesn't depend on $class$. This leaves us with $$ P(class|features) \propto P(features|class) \cdot P(class)$$

The prior probabilities, $P(class)$, can be calculated as you described in your question.

That leaves $P(features|class)$. We want to eliminate the massive, and probably very sparse, joint probability $P(feature_1, feature_2, ..., feature_n|class)$. If each feature is independent , then $$P(feature_1, feature_2, ..., feature_n|class) = \prod_i{P(feature_i|class})$$ Even if they're not actually independent, we can assume they are (that's the "naive" part of naive Bayes). I personally think it's easier to think this through for discrete (i.e., categorical) variables, so let's use a slightly different version of your example. Here, I've divided each feature dimension into two categorical variables.

Discrete Example Data.

Example: Training the classifer

To train the classifer, we count up various subsets of points and use them to compute the prior and conditional probabilities.

The priors are trivial: There are sixty total points, forty are green while twenty are red. Thus $$P(class=green)=\frac{40}{60} = 2/3 \text{ and } P(class=red)=\frac{20}{60}=1/3$$

Next, we have to compute the conditional probabilities of each feature-value given a class. Here, there are two features: $feature_1$ and $feature_2$, each of which takes one of two values (A or B for one, X or Y for the other). We therefore need to know the following:

  • $P(feature_1=A|class=red)$
  • $P(feature_1=B|class=red)$
  • $P(feature_1=A|class=green)$
  • $P(feature_1=B|class=green)$
  • $P(feature_2=X|class=red)$
  • $P(feature_2=Y|class=red)$
  • $P(feature_2=X|class=green)$
  • $P(feature_2=Y|class=green)$
  • (in case it's not obvious, this is all possible pairs of feature-value and class)

These are easy to compute by counting and dividing too. For example, for $P(feature_1=A|class=red)$, we look only at the red points and count how many of them are in the 'A' region for $feature_1$. There are twenty red points, all of which are in the 'A' region, so $P(feature_1=A|class=red)=20/20=1$. None of the red points are in the B region, so $P(feature_1|class=red)=0/20=0$. Next, we do the same, but consider only the green points. This gives us $P(feature_1=A|class=green)=5/40=1/8$ and $P(feature_1=B|class=green)=35/40=7/8$. We repeat that process for $feature_2$, to round out the probability table. Assuming I've counted correctly, we get

  • $P(feature_1=A|class=red)=1$
  • $P(feature_1=B|class=red)=0$
  • $P(feature_1=A|class=green)=1/8$
  • $P(feature_1=B|class=green)=7/8$
  • $P(feature_2=X|class=red)=3/10$
  • $P(feature_2=Y|class=red)=7/10$
  • $P(feature_2=X|class=green)=8/10$
  • $P(feature_2=Y|class=green)=2/10$

Those ten probabilities (the two priors plus the eight conditionals) are our model

Classifying a New Example

Let's classify the white point from your example. It's in the "A" region for $feature_1$ and the "Y" region for $feature_2$. We want to find the probability that it's in each class. Let's start with red. Using the formula above, we know that: $$P(class=red|example) \propto P(class=red) \cdot P(feature_1=A|class=red) \cdot P(feature_2=Y|class=red)$$ Subbing in the probabilities from the table, we get

$$P(class=red|example) \propto \frac{1}{3} \cdot 1 \cdot \frac{7}{10} = \frac{7}{30}$$ We then do the same for green: $$P(class=green|example) \propto P(class=green) \cdot P(feature_1=A|class=green) \cdot P(feature_2=Y|class=green) $$

Subbing in those values gets us 0 ($2/3 \cdot 0 \cdot 2/10$). Finally, we look to see which class gave us the highest probability. In this case, it's clearly the red class, so that's where we assign the point.

Notes

In your original example, the features are continuous. In that case, you need to find some way of assigning P(feature=value|class) for each class. You might consider fitting then to a known probability distribution (e.g., a Gaussian). During training, you would find the mean and variance for each class along each feature dimension. To classify a point, you'd find $P(feature=value|class)$ by plugging in the appropriate mean and variance for each class. Other distributions might be more appropriate, depending on the particulars of your data, but a Gaussian would be a decent starting point.

I'm not too familiar with the DARPA data set, but you'd do essentially the same thing. You'll probably end up computing something like P(attack=TRUE|service=finger), P(attack=false|service=finger), P(attack=TRUE|service=ftp), etc. and then combine them in the same way as the example. As a side note, part of the trick here is to come up with good features. Source IP , for example, is probably going to be hopelessly sparse--you'll probably only have one or two examples for a given IP. You might do much better if you geolocated the IP and use "Source_in_same_building_as_dest (true/false)" or something as a feature instead.

I hope that helps more. If anything needs clarification, I'd be happy to try again!