Machine Learning – Should Stopwords Be Removed Before Generating N-Grams?

machine learningnatural languagenltk

I'm wondering if the stopwords are useful in n-gram or it should be removed before generating n-gram.

I would like to know best practices on extract features of text. I'm currently using nltk.

Best Answer

Should you generally remove stopwords?

Depends on what you use the n-grams for but generally yes, I would recommend to remove them, otherwise a lot of the results highest in your list of occuring n-grams are going to contain them.

When within your code should you remove the stopwords?

Depending on your intended use you can either:

  • calculate common n-grams and then remove those which contain stopwords OR
  • remove stopwords and then calculate common n-grams from the remaining text

Usually of those the first approach is better since this gives you n-grams which actually occur exactly as listed in the original corpus.

A tweaked approach I like to use:

First calculate the n-grams, then remove n-grams containing stopwords, but only remove n-grams which begin and/or end with a stopword.

So the quadgram/4-gram end of the month would not be removed since it does not contain stopwords in the first or last position.