Solved – How to use TFIDF-vectors with Multinomial Naive-Bayes

naive bayesscikit learntf-idf

Say we have used the TFIDF transform to encode documents into continuous-valued features.

How would we now use this as input to a Naive Bayes classifier?

Bernoulli naive-bayes is out, because our features aren't binary anymore.
Seems like we can't use Multinomial naive-bayes either, because the values are continuous rather than categorical.

As an alternative, would it be appropriate to use gaussian naive bayes instead? Are TFIDF vectors likely to hold up well under the gaussian-distribution assumption?

The sci-kit learn documentation for MultionomialNB suggests the following:

The multinomial Naive Bayes classifier is suitable for classification
with discrete features (e.g., word counts for text classification).
The multinomial distribution normally requires integer feature counts.
However, in practice, fractional counts such as tf-idf may also work.

Isn't it fundamentally impossible to use fractional values for MultinomialNB?
As I understand it, the likelihood function itself assumes that we are dealing with discrete-counts:

(From Wikipedia):

${\displaystyle p(\mathbf {x} \mid C_{k})={\frac {(\sum _{i}x_{i})!}{\prod _{i}x_{i}!}}\prod _{i}{p_{ki}}^{x_{i}}}$

How would TFIDF values even work with this formula, since the $x_i$ values are all required to be discrete counts?

Best Answer

This is possible without discretizing your counts or changing the form of your model to something with less natural assumptions (e.g. Gaussian).

The likelihood for a multinomial distribution can be expressed the way you've written it, but it can also be written differently to allow for nonnegative real counts.

$$p(\mathbf {x} \mid C_{k}) =\frac { \left( \sum _{i}x_{i} \right)! }{ \prod_{i}x_{i}! } \prod_{i}{p_{ki}}^{x_{i}} = \frac{ \Gamma\left(1 + \sum _{i}x_{i}\right) }{ \prod_{i} \Gamma\left(1 + x_{i}\right) } \prod_{i}{p_{ki}}^{x_{i}} $$

This comes from the identity $n! = \Gamma(n+1)$. The gamma function generalizes the factorial function to nonnegative reals. (It generalizes also to negative non-integral reals, but that's not relevant.)

In this latter form, you can use non-integral counts, like tf-idf scores for words or pseudocounts from a fractional Dirichlet prior.

scikit-learn handles non-integral counts just fine, by the way.

Related Question