Solved – Alternatives to bag-of-words based classifiers for text classification

classificationmachine learningtext mining

Most of the text classifiers are based on the bag-of-words approach where you loose the context that a particular word appears. As a solution (or simple solution?) we can use n-grams as features. But are there any classifiers which "gist" the idea and model it in someway before training?

Best Answer

I suggest two alternatives, that have been extensively used in Text Classification:

  • Using Latent Semantic Indexing, which consists of applying Singular Value Decomposition to the DocumentXTerm matrix in order to identify relevant (concept) components, or in other words, aims to group words into classes that represent concepts or semantic fields.
  • Using a lexical database like WordNet or BabelNet concepts in order to index the documents, allowing semantic-level comparison of documents. This approach is not statistical, and it faces a problem with Word Sense Disambiguation.

Both methods can be applied before training. None of the them aim at catching word order.