Solved – Confused among Gaussian, Multinomial and Binomial Naive Bayes for Text Classification

machine learningnaive bayestext mining

I am doing text classification but I am confused which Naive Bayes model I should use. What I understood by reading answers from couple of places that Gaussian Naive Bayes can be used if the attribute values are continuous, when attribute values are binary, binomial Naive Bayes can be used, for examples if we have words as features, we look into each sample to see if that word is present or not and thats how we get a matrix of S (sample) * V(vocubulary of words) dimension for text classification. Now, if we had actual word counts for creating S * V matrix, we would use multinomial Naive Bayes. My question is, if we use tf-idf (which has continuous/fraction value) for S * V matrix, which Naive Bayes Classification model should we use?

Am I getting conceptually wrong idea of data distribution?

Best Answer

Typically the multinomial naive bayes model would still be used - basically using the decimal TF-IDF values for each term in each document in place of the count for that term and proceeding as you usually would (TF-IDF is always $>= 0$). This paper provides details of one way to do that and study results:

Rennie, J.; Shih, L.; Teevan, J.; Karger, D. (2003). Tackling the poor assumptions of Naive Bayes classifiers (ICML.http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf).

Detailed explanation:

The idea is instead of using the term frequencies divided by the total number of terms as the categorical probabilities, you compute the TF-IDF representation of each document and use the fraction of TF-IDF values given to each term for a given class - i.e., sum up the values for a term across all documents in the class divided by the total of the sum of values for all terms - to get the probability value for each term. So basically TF-IDF value totals as opposed to count totals are used - instead of adding up 1s and 0s now you are adding decimal values, but the procedure is the same.

Then as with the traditional multinomial naive bayes you take the log of each term probability to get the log-linear decision function. However, in the traditional model you would then multiply this log value with the term frequency and sum across terms. Instead this paper proposes a final normalization first to normalize the log values across terms, before then applying the same linear decision rule.

Table 4 in the paper spells out this procedure clearly.