Solved – Multinomial Naive Bayes is not Multinomial in text classification

classificationgraphical-modelmachine learningnaive bayesnatural language

According to Wiki, the Multinomial Naive Bayes's conditional distribution is:
$$p(\mathbf{x} \vert C=k) = \text{Multinomial}(n,\mathbf p_k) = \frac{(\sum_d x_d)!}{\prod_d x_d !} \prod_d {p_{kd}}^{x_d}$$
where $\bf x$ is feature and $C$ is class. $d$ is the number of dimension of feature.

When using in text domain:
given an $i$th document's word feature $\mathbf x_i=(w_1,…,w_d)$, $d =|Vocabulary|$.

The document length is the parameter n of Multinomial, $n=\sum_d x_d$.
But every document length is different ! So $p(\mathbf x_i|c)$ is not Multinomial$(n,\mathbf p_c)$ but a Multinomial$(n_i,\mathbf p_c) $. (that is , the distribution is changing with sample $\mathbf x_i$)

The consequence is that $p(\mathbf x|C)$ is no longer the Multinomial distribution and $\sum_x p(\mathbf x|C)$ is not equal to 1.

It is based on nothing more than the Multinoulli or Categorical distribution.

Am I missing something?


this wiki has a good example for text classification.

EDIT: I have totally revised the post. For where that is still unclear plz comment me.

EDIT2: But people are still using it regardless of the document length. Why?

Best Answer

I disagree, there is nothing invalid about the so-called multinomial naive bayes model in this case.

  • I think the confusion you are facing is because you are thinking $n$ is a parameter of the model we are estimating, but it is not. We are estimating the parameters $p_d$ only - i.e., we are actually estimating a categorical distribution. The multinomial arises because the generative process for a document is then a multinomial using the underlying categorical distribution. The document lengths are fixed (known) - so the generative process is essentially $n$ draws from the categorical distribution. The value of $n$ changes for each document so the $n$ parameter of the multinomial distribution changes, but from our categorical probability estimates we can still compute the probabilities then for each document, regardless of $n$.
  • The parameters $p_d$ are estimated from the entire data set. I.e., we estimate $p_d$ across the training set to get the probability of drawing each term each time we draw a term when generating a document. The number of terms we draw for each document may be different($n_i$ as you say), but we still assume the same categorical distribution for each draw for each document. Then for each document we use the actual frequencies observed to get the probability of having seen that set of terms in the document given those parameters $p_d$ - the length of the document $n_i$ can vary but that doesn't matter we can still use $p_d$ to calculate which class has the highest probability.
  • I.e., there is no problem with the "naive bayes" model here.

I would however argue that the so-called multinomial naive bayes model is not really a naive bayes model in the strict sense

  • Since it doesn't use the naive bayes assumption that each feature is conditionally independent of every other feature given the class. Instead it is modeling the joint conditional distribution directly, not as the product of distributions for individual features. The conditional independence assumption doesn't even hold for the multinomial distribution.

  • To see this consider a document with 10 terms. If we know the frequency for term 1 is 10, the probability of term 2 having frequency greater than 0 must be 0. However, if we know it is 0, this probability could then be non-zero. So you can see that, even knowing the class, we could not say the features are conditionally independent, where the feature values are the term frequencies for each term.

  • However, in terms of the generative process itself, it can be viewed as following the naive bayes assumption - if we assume that each term in the document (and not just the term counts) is a feature. Then no matter what the previous term or terms were in the document, we assume the value of the next term is conditionally independent given the class - i.e., that it follows the same categorical distribution. So in that sense, you could say it does use the naive Bayes assumption.