A traditional approach of feature construction for text mining is bag-of-words approach, and can be enhanced using tf-idf for setting up the feature vector characterizing a given text document. At present, I am trying to using bi-gram language model or (N-gram) for building feature vector, but do not quite know how to do that? Can we just follow the approach of bag-of-words, i.e., computing the frequency count in terms of bi-gram instead of words, and enhancing it using tf-idf weighting scheme?
Bigram (N-gram) Model – Using Bigram Model to Build Feature Vector for Text Document
data mininglanguage-modelsmachine learningnatural languagetext mining
Related Question
- Solved – Why is n-gram used in text language identification instead of words
- Text Classification – Bag-of-Words vs. Word Frequencies vs. TFIDF
- Solved – Where did sublinear tf-idf originate
- Solved – Apply word embeddings to entire document, to get a feature vector
- Solved – TFIDF for feature selection method for unlabeled text documents
Best Answer
Yes. That will generate many more features though: it might be important to apply some cut-off (for instance discard features such bi-grams or words that occur less than 5 times in your dataset) so as to not drown your classifier with too many noisy features.