Solved – the best form (Gaussian, Multinomial) of Naive Bayes to use with categorical (one-hot encoded) features

classificationmachine learningnaive bayes

I've been asked to use the Naive Bayes classifier to classify a couple of samples.

My dataset had categorical features so I had to first encode them using a one-hot encoder, but then I was at a loss as for which statistical model to use (e.g. Gaussian NB, Multinomial NB).

I ended up using the multinomial version because I read somewhere that it worked well in NLP and IR tasks due to documents being represented as term-count vectors or TF-IDF weights.

I would like to know if that was correct and, if possible, a quick explanation on why that is so.

PS There is this somewhat similar question, but I'm not sure whether that also applies to strictly binary (0 or 1) feature vectors.

Best Answer

As others mentioned, there isn't a "right" model. However, since you used one-hot encoding, you are basically dealing with boolean features now. In other words each term/feature is following a Bernoulli distribution. That being said, I would use a multivariate Bernoulli NB or a multinomial NB with boolean features (which you already have). Gaussian NB seems a bit off here since you don't deal with real-valued features.

This excellent paper has a lot of information on different NB variants and when to use which.