Solved – Counting frequencies in Multinomial Naive Bayes

multinomial-distributionnaive bayes

I have noticed some ambiguity/inconsistency in how various authors calculate the p(word|label) estimate in Multinomial Naive Bayes. In some cases, this value is calculated by counting words without regard to which documents those words are in. Dan Jurafsky does so in his lecture notes Text Classification and Naïve Bayes and several related videos:

Parameter estimation

This widely-cited paper also calculates that value by counting words, without regard to document:

Tackling the Poor Assumptions of Naive Bayes Text Classifiers

Parameters must be estimated from the training data. We do this by selecting a Dirichlet prior and taking the expectation of the parameter with respect to the posterior. For details, we refer the reader to Section 2 of Heckerman (1995). This gives us a simple form for the estimate of the multinomial parameter, which involves the number of times word i appears in the documents in class c (Nci), divided by the total number of word occurrences in class c (Nc).

However, I have seen other sources that compute this value by counting documents, not words, such as these lecture notes from Andrew Ng:

CS229 Lecture notes

The parameters have a very natural interpretation. For instance, φj|y=1 is just the fraction of the spam (y = 1) emails in which word j does appear.

I have worked out various toy examples by hand, trying things both ways (as well as with/without Laplace smoothing, with the "naive" independence assumption holding/violated, and with all documents having the same/different number of words) and noted the following:

  • When all documents contain the same number of words, counting either way gives the same result.
  • When the "naive" assumption does hold true, counting either way gives the same result.
  • When documents contain different numbers of words and the "naive" assumption is actually not true (i.e. the typical case) you get different numbers depending on which way you count. Notably, you even get different numbers when calculating the values for "full" Bayes.

I have not been able to find any explanation for these two different ways of calculating the p(word|label) value, and quite frankly (pace Jurafsky) counting by word seems like a mistake to me. However, since the difference only seems to matter when the "naive" independence assumption is violated, I wonder if perhaps this is somehow meant to correct for that? If so, I have yet to see any explanation/justification for this.

Can anybody shed some light on why there seem to be two fundamentally different ways of calculating this value?

Best Answer

When you count documents, you assume that of word occurs at least once, this is enough for you to make decision. For example, if you see "Viagra" in some e-mail you can be relatively sure that it is a spam. You would also count documents if you don't expect specific words to appear multiple times in documents (i.e. when it wouldn't matter much).

It is the opposite with counting words, if more occurrences would make you more convinced to make some classification, then you would count them. Example may be an e-mail that mentions "lots of money" multiple times, if you wouldn't know anything else about it, then the significant count of such phrase might make you think that this may be some kind of scam.

In many machine learning applications people use both kinds of counts, by using TF-IDF encoding.