Bigram (N-gram) Model – Using Bigram Model to Build Feature Vector for Text Document

data mininglanguage-modelsmachine learningnatural languagetext mining

A traditional approach of feature construction for text mining is bag-of-words approach, and can be enhanced using tf-idf for setting up the feature vector characterizing a given text document. At present, I am trying to using bi-gram language model or (N-gram) for building feature vector, but do not quite know how to do that? Can we just follow the approach of bag-of-words, i.e., computing the frequency count in terms of bi-gram instead of words, and enhancing it using tf-idf weighting scheme?

Best Answer

Yes. That will generate many more features though: it might be important to apply some cut-off (for instance discard features such bi-grams or words that occur less than 5 times in your dataset) so as to not drown your classifier with too many noisy features.