Solved – Naive Bayes: Mix unigrams and bigrams for text classification

machine learningnaive bayes

I'm creating a naive bayes text classifier, but I'm wondering if it's a good idea to break the text up into both unigrams and bigrams. Should I only use one method? Will having both variations mess with the algorithm?

Best Answer

Having a variation is a very common practice. It actually has a smoothing effect.

Bigram --> Low bias/high variance Unigram --> High bias/low variance.

Combination the two helps to "hedge" the bets made between the two. See for instance, the first equation on pg. 13 of this link. The author shows how to merge trigram, bigram and unigram estimates