Solved – Bernoulli NB vs MultiNomial NB, How to choose among different NB algorithms

bernoulli-distributionmachine learningmultinomial-distributionnaive bayes

I want to understand the logic behind using a specific type of NB algorithm for a particular dataset. I read about Naive Bayes but still few things are unclear.

According to my understanding of NB algorithm:

1.Gaussian NB: It should be used for features in decimal form. GNB assumes features to follow a normal distribution.

2.MultiNomial NB: It should be used for the features with discrete values like word count 1,2,3…

3.Bernoulli NB: It should be used for features with binary or boolean values like True/False or 0/1.

Am I correct till this point? If not then please correct me.

I was working on Phishing data from UCI repo. Some of its feature have values {1,0,-1} and some have {1,0}. It looks discrete data. So, I tried using MulinomialNB but it gave error due negative(-1) values in features. To solve this issue I added +1 to whole dataframe.

phising_df = phising_df + 1

Now range of values for features changed to (0,1,2)

After doing this, it worked and gave 85% accuracy.

Just to check whether Bernoulli NB will work this data or not. I ran it on BernoulliNB. This time I got 90%+ accuracy.

In acedemic literature, it shows that we benoulli can work only with binary data then how it worked with data with 3 values(0,1,2)?
Please let me know, If I am following the right approach or not? Suggest differences among the various types of NB algorithms.

PS: my ipynb notebook is available at this link

Best Answer

The variant of Naive Bayes you use depends on the data. If your data consists of counts, the multinomial distribution may be an appropriate distribution for the likelihood, and thus multinomial Naive Bayes is appropriate.

Likewise, if your data points come from distribution $X$, use the likelihood for $X$ for Naive Bayes. Thus, it becomes $X$ Naive Bayes.