Solved – When to use Bernoulli Naive Bayes

machine learningnaive bayespython

Below is an example of a dummy scatter plot of x,y where BLUE (0) and RED(1) are the 'target'. The yellow dot is my input and I'm asking what is the prediction that this is either BLUE (0) or RED(1)

By using Gaussian Naive Bayes (GaussianNB), I get a prediction of 0 with a probability of 99.99% (make sense)

Now, when I use Bernoulli Naive Bayes (BernoulliNB), I get a prediction of RED(1) with a probability of 0.4202. (BTW, Multinominal NB is also off: 0.57%)

Questions:

When will one use Bernoulli Naive Bayes (appreciate an example) and
Why in this instance Bernoulli's prediction is so off?

Best Answer

Bernoulli Naive Bayes is for binary features only. Similarly, multinomial naive Bayes treats features as event probabilities. Your example is given for nonbinary real-valued features $(x,y)$, which do not exclusively lie in the interval $[0,1]$, so the models do not apply to your features.

A typical example (taken from the wiki page) for either Bernoulli or multinomial NB is document classification, where the features represent the presence of a term (in the Bernoulli case) or the probability of a term (in the multinomial case).

Related Solutions

Solved – Understanding Naive Bayes

I'm going to run through the whole Naive Bayes process from scratch, since it's not totally clear to me where you're getting hung up.

We want to find the probability that a new example belongs to each class: $P(class|feature_1, feature_2,..., feature_n$). We then compute that probability for each class, and pick the most likely class. The problem is that we usually don't have those probabilities. However, Bayes' Theorem lets us rewrite that equation in a more tractable form.

Bayes' Thereom is simply$$P(A|B)=\frac{P(B|A) \cdot P(A)}{P(B)}$$ or in terms of our problem: $$P(class|features)=\frac{P(features|class) \cdot P(class)}{P(features)}$$

We can simplify this by removing $P(features)$. We can do this because we're going to rank $P(class|features)$ for each value of $class$; $P(features)$ will be the same every time--it doesn't depend on $class$. This leaves us with $$ P(class|features) \propto P(features|class) \cdot P(class)$$

The prior probabilities, $P(class)$, can be calculated as you described in your question.

That leaves $P(features|class)$. We want to eliminate the massive, and probably very sparse, joint probability $P(feature_1, feature_2, ..., feature_n|class)$. If each feature is independent , then $$P(feature_1, feature_2, ..., feature_n|class) = \prod_i{P(feature_i|class})$$ Even if they're not actually independent, we can assume they are (that's the "naive" part of naive Bayes). I personally think it's easier to think this through for discrete (i.e., categorical) variables, so let's use a slightly different version of your example. Here, I've divided each feature dimension into two categorical variables.

Discrete Example Data .

Example: Training the classifer

To train the classifer, we count up various subsets of points and use them to compute the prior and conditional probabilities.

The priors are trivial: There are sixty total points, forty are green while twenty are red. Thus $$P(class=green)=\frac{40}{60} = 2/3 \text{ and } P(class=red)=\frac{20}{60}=1/3$$

Next, we have to compute the conditional probabilities of each feature-value given a class. Here, there are two features: $feature_1$ and $feature_2$, each of which takes one of two values (A or B for one, X or Y for the other). We therefore need to know the following:

$P(feature_1=A|class=red)$
$P(feature_1=B|class=red)$
$P(feature_1=A|class=green)$
$P(feature_1=B|class=green)$
$P(feature_2=X|class=red)$
$P(feature_2=Y|class=red)$
$P(feature_2=X|class=green)$
$P(feature_2=Y|class=green)$
(in case it's not obvious, this is all possible pairs of feature-value and class)

These are easy to compute by counting and dividing too. For example, for $P(feature_1=A|class=red)$, we look only at the red points and count how many of them are in the 'A' region for $feature_1$. There are twenty red points, all of which are in the 'A' region, so $P(feature_1=A|class=red)=20/20=1$. None of the red points are in the B region, so $P(feature_1|class=red)=0/20=0$. Next, we do the same, but consider only the green points. This gives us $P(feature_1=A|class=green)=5/40=1/8$ and $P(feature_1=B|class=green)=35/40=7/8$. We repeat that process for $feature_2$, to round out the probability table. Assuming I've counted correctly, we get

$P(feature_1=A|class=red)=1$
$P(feature_1=B|class=red)=0$
$P(feature_1=A|class=green)=1/8$
$P(feature_1=B|class=green)=7/8$
$P(feature_2=X|class=red)=3/10$
$P(feature_2=Y|class=red)=7/10$
$P(feature_2=X|class=green)=8/10$
$P(feature_2=Y|class=green)=2/10$

Those ten probabilities (the two priors plus the eight conditionals) are our model

Classifying a New Example

Let's classify the white point from your example. It's in the "A" region for $feature_1$ and the "Y" region for $feature_2$. We want to find the probability that it's in each class. Let's start with red. Using the formula above, we know that: $$P(class=red|example) \propto P(class=red) \cdot P(feature_1=A|class=red) \cdot P(feature_2=Y|class=red)$$ Subbing in the probabilities from the table, we get

$$P(class=red|example) \propto \frac{1}{3} \cdot 1 \cdot \frac{7}{10} = \frac{7}{30}$$ We then do the same for green: $$P(class=green|example) \propto P(class=green) \cdot P(feature_1=A|class=green) \cdot P(feature_2=Y|class=green) $$

Subbing in those values gets us 0 ($2/3 \cdot 0 \cdot 2/10$). Finally, we look to see which class gave us the highest probability. In this case, it's clearly the red class, so that's where we assign the point.

Notes

In your original example, the features are continuous. In that case, you need to find some way of assigning P(feature=value|class) for each class. You might consider fitting then to a known probability distribution (e.g., a Gaussian). During training, you would find the mean and variance for each class along each feature dimension. To classify a point, you'd find $P(feature=value|class)$ by plugging in the appropriate mean and variance for each class. Other distributions might be more appropriate, depending on the particulars of your data, but a Gaussian would be a decent starting point.

I'm not too familiar with the DARPA data set, but you'd do essentially the same thing. You'll probably end up computing something like P(attack=TRUE|service=finger), P(attack=false|service=finger), P(attack=TRUE|service=ftp), etc. and then combine them in the same way as the example. As a side note, part of the trick here is to come up with good features. Source IP , for example, is probably going to be hopelessly sparse--you'll probably only have one or two examples for a given IP. You might do much better if you geolocated the IP and use "Source_in_same_building_as_dest (true/false)" or something as a feature instead.

I hope that helps more. If anything needs clarification, I'd be happy to try again!

Solved – Naive Bayes Python implementation differences

Scikit learn has several Naive bayes. Naive bayes is usually a quick and dirty way to do classification.

The different ones used are:

Gaussian Naive Bayes: which normally used

Bernoulli Naive Bayes: used for things with 2 variables (heads or tails, yes or no)

Multinomial Naive Bayes: Usually used for text processing, where you have a smoothing parameter for missing data.

Here is a good explaination of the Naive bayes with +1 smoothing https://www.youtube.com/watch?v=0hxaqDbdIeE

Not sure about NLTK but sklearn is optomized based on this paper:http://www.cs.unb.ca/profs/hzhang/publications/FLAIRS04ZhangH.pdf

Also sklearn's algorithms are further optimized using cython.