Solved – Is the Laplace/Lidstone smoothing parameter (talking about Multinomial/Bernoulli Naive Bayes) related to the particular structure of the dataset

classificationlaplace-smoothingmachine learningnaive bayes

I'm working with Multinomial and Bernoulli Naive Bayes implementation of scikit-learn (python) for text classification. I'm using the 20_newsgroups dataset.
From the scikit documentation we have:

class sklearn.naive_bayes.MultinomialNB(alpha=1.0, fit_prior=True, class_prior=None)

and

class sklearn.naive_bayes.BernoulliNB(alpha=1.0, binarize=0.0, fit_prior=True, class_prior=None)

so we need to give a float value to alpha, which represents the smoothing parameter (as scikit says: "setting alpha = 1 is called Laplace smoothing, while alpha < 1 is called Lidstone smoothing").

Now, I noticed the Multinomial version works pretty well with a Laplace smoothing (alpha=1.0), while the Bernoulli one seems pretty bad with such a value.
I tried to give different values to the Bernoulli alpha and noticed Bernoulli had acceptable accuracy if alpha was something like "0.01", "0.03", "0.001"…ecc.
So I thought Bernoulli Naive Bayes "prefers" a Lidstone smoothing.
Now, my question is: is it always like that (alpha=1.0 with Multinomial and alpha=0.01 with Bernoulli) or is the value for the smoothing parameter related to the particular structure of the dataset we're using?

Best Answer

It's not related to the "structure", it's related to the level of certainty that the relative count for a given case in your data is a correct estimation of its probability. (By "relative count" I mean the rate: Number of occurrences divides by total number of examples in the dataset.)

Consider a dataset with a few features and a label that is "positive" or "negative". Let's say the positives and negatives are split 50/50. One of the features, called F, is 0 for all occurrences of "positive". What is the probability to get a positive, given that F=1?

P("positive" | F=1)  =  P(F=1 | "positive") * P("positive") / P(F=1)

Given that there are no positives in your dataset for F=1, you can set P(F=1|"positive") to 0. Or you can argue that your dataset is not infinite, and that the real probability is greater than 0. If you believe that, you should set alpha > 0.