Solved – Text Classification using TfIdf and Bernoulli NB

bernoulli-distributionclassificationnaive bayestext mining

So, as I am reading about Bernoulli distribution and text classification, I want to understand how Bernoulli uses TfIdf features? Since TfIdf values are within [0-1) but Multivariate Bernoulli assumes that the features are 0/1. So, how does it work?

I also found this tutorial page on scikit-learn for text classification in which the train and test features are extracted as below:

vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5,
                             stop_words='english')
X_train = vectorizer.fit_transform(data_train.data)

X_test = vectorizer.transform(data_test.data)

and then Bernoulli distribution is applied:

clf = BernoulliNB(alpha=.01)
clf.fit(X_train, y_train)
pred = clf.predict(X_test)

Best Answer

Eventually, the features in the Bernoulli are indeed binary. However, you can control the threshold of when a certain Tfidf will be converted to a 0 or 1.

If you use scikit-learn, the parameter for BernoulliNB is called

binarize : float or None, optional
Threshold for binarizing (mapping to booleans) of sample features. If None, input is presumed to already consist of binary vectors.

The more typical case for Bernoulli models is to set binary=True in the CountVectorizer. If you are using Tfidfs, you will probably have more success with a Multinomial model -- at least that is what I typically observe: Training naive Bayes models is cheap so I usually always compare both Bernoulli and Multinomial.