I'm wondering if the stopwords are useful in n-gram or it should be removed before generating n-gram.
I would like to know best practices on extract features of text. I'm currently using nltk.
machine learningnatural languagenltk
I'm wondering if the stopwords are useful in n-gram or it should be removed before generating n-gram.
I would like to know best practices on extract features of text. I'm currently using nltk.
Best Answer
Should you generally remove stopwords?
Depends on what you use the n-grams for but generally yes, I would recommend to remove them, otherwise a lot of the results highest in your list of occuring n-grams are going to contain them.
When within your code should you remove the stopwords?
Depending on your intended use you can either:
Usually of those the first approach is better since this gives you n-grams which actually occur exactly as listed in the original corpus.
A tweaked approach I like to use:
First calculate the n-grams, then remove n-grams containing stopwords, but only remove n-grams which begin and/or end with a stopword.
So the quadgram/4-gram
end of the month
would not be removed since it does not contain stopwords in the first or last position.