According to Wiki, the Multinomial Naive Bayes's conditional distribution is:
$$p(\mathbf{x} \vert C=k) = \text{Multinomial}(n,\mathbf p_k) = \frac{(\sum_d x_d)!}{\prod_d x_d !} \prod_d {p_{kd}}^{x_d}$$
where $\bf x$ is feature and $C$ is class. $d$ is the number of dimension of feature.
When using in text domain:
given an $i$th document's word feature $\mathbf x_i=(w_1,…,w_d)$, $d =|Vocabulary|$.
The document length is the parameter n of Multinomial, $n=\sum_d x_d$.
But every document length is different ! So $p(\mathbf x_i|c)$ is not Multinomial$(n,\mathbf p_c)$ but a Multinomial$(n_i,\mathbf p_c) $. (that is , the distribution is changing with sample $\mathbf x_i$)
The consequence is that $p(\mathbf x|C)$ is no longer the Multinomial distribution and $\sum_x p(\mathbf x|C)$ is not equal to 1.
It is based on nothing more than the Multinoulli
or Categorical
distribution.
Am I missing something?
this wiki has a good example for text classification.
EDIT: I have totally revised the post. For where that is still unclear plz comment me.
EDIT2: But people are still using it regardless of the document length. Why?
Best Answer
I disagree, there is nothing invalid about the so-called multinomial naive bayes model in this case.
I would however argue that the so-called multinomial naive bayes model is not really a naive bayes model in the strict sense
Since it doesn't use the naive bayes assumption that each feature is conditionally independent of every other feature given the class. Instead it is modeling the joint conditional distribution directly, not as the product of distributions for individual features. The conditional independence assumption doesn't even hold for the multinomial distribution.
To see this consider a document with 10 terms. If we know the frequency for term 1 is 10, the probability of term 2 having frequency greater than 0 must be 0. However, if we know it is 0, this probability could then be non-zero. So you can see that, even knowing the class, we could not say the features are conditionally independent, where the feature values are the term frequencies for each term.
However, in terms of the generative process itself, it can be viewed as following the naive bayes assumption - if we assume that each term in the document (and not just the term counts) is a feature. Then no matter what the previous term or terms were in the document, we assume the value of the next term is conditionally independent given the class - i.e., that it follows the same categorical distribution. So in that sense, you could say it does use the naive Bayes assumption.