Solved – Why is n-gram used in text language identification instead of words

classificationmachine learningnatural languagetext mining

In two popular language identification libraries, Compact Language Detector 2 for C++ and language detector for java, both of them used (character based) n-grams to extract text features. Why is a bag-of-words (single word/dictionary) not used, and what is the advantage and disadvantage of bag-of-words and n-grams?

Also, what are some other uses of n-grams model in text classification?

Oh oops. Seems like there is a similar question here: Regarding using bigram (N-gram) model to build feature vector for text document

But can someone give a more comprehensive answer? Which is better in the case of language identification?

(Hopefully I got the meaning of n-grams and bag-of-words correct, haha, if not please help me with that.)

Best Answer

I think the most detailed answers can be found in Mehryar Mohri's extensive work on the topic. Here's a link to one of his lecture slides on the topic: https://web.archive.org/web/20151125061427/http://www.cims.nyu.edu/~mohri/amls/lecture_3.pdf

The problem of language detection is that human language (words) have structure. For example, in English, it's very common for the letter 'u' to follow the letter 'q,' while this is not the case in transliterated Arabic. n-grams work by capturing this structure. Thus, certain combinations of letters are more likely in some languages than others. This is the basis of n-gram classification.

Bag-of-words, on the other hand, depends on searching through a large dictionary and essentially doing template matching. There are two main drawbacks here: 1) each language would have to have an extensive dictionary of words on file, which would take a relatively long time to search through, and 2) bag-of-words will fail if none of the words in the training set are included in the testing set.

Assuming that you are using bigrams (n=2) and there are 26 letters in your alphabet, then there are only 26^2 = 676 possible bigrams for that alphabet, many of which will never occur. Therefore, the "profile" (to use language detector's words) for each language needs a very small database. A bag-of-words classifier, on-the-other-hand would need a full dictionary for EACH language in order to guarantee that a language could be detected based on whichever sentence it was given.

So in short - each language profile can be quickly generated with a relatively small feature space. Interestingly, n-grams only work because letters are not drawn iid in a language - this is explicitly leverage.

Note: the general equation for the number of n-grams for words is l^n where l is the number of letters in the alphabet.