Overfitting – How Naive Bayes Can Overfit the Data

laplace-smoothingnaive bayesoverfitting

I know that Laplace smoothing results in a high bias of Naive Bayes. If the value of the smoothing parameter (alpha) is large, then the probability distribution will be uniform for all the features. And Naive Bayes classifies based on the values of the likelihood of conditional probabilities. So the model will now classify based on the class label resulting in a high bias.
Now, my question is can a Naive Bayes model overfit the data? If so How?

Best Answer

In general, overfitting is not something you should worry that much with naive Bayes. It’s more likely to underfit. Naive Bayes is a fairly simple algorithm, making a strong assumption of independence between the features, so it would be biased and less flexible, hence less likely to overfit. But it is possible. For example, say that in your training set you have a variable of variables that have very high (or low) probabilities influencing the predictions and the probabilities differ from what you’d see outside training data. The model would be accurate at training time but not prediction time, hence would overfit.

Related Solutions

Naive Bayes – Detailed Understanding

I'm going to run through the whole Naive Bayes process from scratch, since it's not totally clear to me where you're getting hung up.

We want to find the probability that a new example belongs to each class: $P(class|feature_1, feature_2,..., feature_n$). We then compute that probability for each class, and pick the most likely class. The problem is that we usually don't have those probabilities. However, Bayes' Theorem lets us rewrite that equation in a more tractable form.

Bayes' Thereom is simply$$P(A|B)=\frac{P(B|A) \cdot P(A)}{P(B)}$$ or in terms of our problem: $$P(class|features)=\frac{P(features|class) \cdot P(class)}{P(features)}$$

We can simplify this by removing $P(features)$. We can do this because we're going to rank $P(class|features)$ for each value of $class$; $P(features)$ will be the same every time--it doesn't depend on $class$. This leaves us with $$ P(class|features) \propto P(features|class) \cdot P(class)$$

The prior probabilities, $P(class)$, can be calculated as you described in your question.

That leaves $P(features|class)$. We want to eliminate the massive, and probably very sparse, joint probability $P(feature_1, feature_2, ..., feature_n|class)$. If each feature is independent , then $$P(feature_1, feature_2, ..., feature_n|class) = \prod_i{P(feature_i|class})$$ Even if they're not actually independent, we can assume they are (that's the "naive" part of naive Bayes). I personally think it's easier to think this through for discrete (i.e., categorical) variables, so let's use a slightly different version of your example. Here, I've divided each feature dimension into two categorical variables.

Discrete Example Data .

Example: Training the classifer

To train the classifer, we count up various subsets of points and use them to compute the prior and conditional probabilities.

The priors are trivial: There are sixty total points, forty are green while twenty are red. Thus $$P(class=green)=\frac{40}{60} = 2/3 \text{ and } P(class=red)=\frac{20}{60}=1/3$$

Next, we have to compute the conditional probabilities of each feature-value given a class. Here, there are two features: $feature_1$ and $feature_2$, each of which takes one of two values (A or B for one, X or Y for the other). We therefore need to know the following:

$P(feature_1=A|class=red)$
$P(feature_1=B|class=red)$
$P(feature_1=A|class=green)$
$P(feature_1=B|class=green)$
$P(feature_2=X|class=red)$
$P(feature_2=Y|class=red)$
$P(feature_2=X|class=green)$
$P(feature_2=Y|class=green)$
(in case it's not obvious, this is all possible pairs of feature-value and class)

These are easy to compute by counting and dividing too. For example, for $P(feature_1=A|class=red)$, we look only at the red points and count how many of them are in the 'A' region for $feature_1$. There are twenty red points, all of which are in the 'A' region, so $P(feature_1=A|class=red)=20/20=1$. None of the red points are in the B region, so $P(feature_1|class=red)=0/20=0$. Next, we do the same, but consider only the green points. This gives us $P(feature_1=A|class=green)=5/40=1/8$ and $P(feature_1=B|class=green)=35/40=7/8$. We repeat that process for $feature_2$, to round out the probability table. Assuming I've counted correctly, we get

$P(feature_1=A|class=red)=1$
$P(feature_1=B|class=red)=0$
$P(feature_1=A|class=green)=1/8$
$P(feature_1=B|class=green)=7/8$
$P(feature_2=X|class=red)=3/10$
$P(feature_2=Y|class=red)=7/10$
$P(feature_2=X|class=green)=8/10$
$P(feature_2=Y|class=green)=2/10$

Those ten probabilities (the two priors plus the eight conditionals) are our model

Classifying a New Example

Let's classify the white point from your example. It's in the "A" region for $feature_1$ and the "Y" region for $feature_2$. We want to find the probability that it's in each class. Let's start with red. Using the formula above, we know that: $$P(class=red|example) \propto P(class=red) \cdot P(feature_1=A|class=red) \cdot P(feature_2=Y|class=red)$$ Subbing in the probabilities from the table, we get

$$P(class=red|example) \propto \frac{1}{3} \cdot 1 \cdot \frac{7}{10} = \frac{7}{30}$$ We then do the same for green: $$P(class=green|example) \propto P(class=green) \cdot P(feature_1=A|class=green) \cdot P(feature_2=Y|class=green) $$

Subbing in those values gets us 0 ($2/3 \cdot 0 \cdot 2/10$). Finally, we look to see which class gave us the highest probability. In this case, it's clearly the red class, so that's where we assign the point.

Notes

In your original example, the features are continuous. In that case, you need to find some way of assigning P(feature=value|class) for each class. You might consider fitting then to a known probability distribution (e.g., a Gaussian). During training, you would find the mean and variance for each class along each feature dimension. To classify a point, you'd find $P(feature=value|class)$ by plugging in the appropriate mean and variance for each class. Other distributions might be more appropriate, depending on the particulars of your data, but a Gaussian would be a decent starting point.

I'm not too familiar with the DARPA data set, but you'd do essentially the same thing. You'll probably end up computing something like P(attack=TRUE|service=finger), P(attack=false|service=finger), P(attack=TRUE|service=ftp), etc. and then combine them in the same way as the example. As a side note, part of the trick here is to come up with good features. Source IP , for example, is probably going to be hopelessly sparse--you'll probably only have one or two examples for a given IP. You might do much better if you geolocated the IP and use "Source_in_same_building_as_dest (true/false)" or something as a feature instead.

I hope that helps more. If anything needs clarification, I'd be happy to try again!

Solved – Greater than 1 Naive Bayes Probabilities

$F_{1}$ and $F_{2}$ are independent given $C$. So the problem is in the denominator. Recall that $P(F_{1},F_{2}) = \sum_{C}P(F_{1},F_{2},C)$.