Solved – At what n do n-grams become counterproductive

natural languagetext mining

When doing natural language processing, one can take a corpus and evaluate the probability of the next word occurring in a sequence of n. n is usually chosen as 2 or 3 (bigrams and trigrams).

Is there a known point at which tracking the data for the nth chain becomes counterproductive, given the amount of time it takes to classify a particular corpus once at that level? Or given the amount of time it would take to look up the probabilities from a (data structure) dictionary?

Best Answer

Is there a known point at which tracking the data for the nth chain becomes counterproductive, given the amount of time it takes to classify a particular corpus once at that level?

You should be looking for perplexity vs. n-gram size tables or plots.

Examples:

http://www.itl.nist.gov/iad/mig/publications/proceedings/darpa97/html/seymore1/image2.gif :

enter image description here

http://images.myshared.ru/17/1041315/slide_16.jpg :

enter image description here

http://images.slideplayer.com/13/4173894/slides/slide_45.jpg :

enter image description here

The perplexity depends on your language model, n-gram size, and data set. As usual, there is a trade-off between the quality of the language model, and how long it takes to run. The best language models nowadays are based on neural networks, so the choice of n-gram size is less of an issue (but then you need to choose the filter size(s) if you use CNN, amongst other hyperparameters…).

Related Question