Solved – Is it feasible to use k-Nearest Neighbours to identify text language

classificationdistancek nearest neighbourmachine learningnatural language

I have seen various language identification libraries that claim to use naive Bayes classifier for text language identification, like CLD2 and language detector, but not any library that uses other algorithms.

I wrote a simple program that uses 1-NN with Euclidian distance measure for identifying language of texts. The features I have chosen are the occurrence of each word, or n-grams, where n can be anything from 1 to 4. I trained it with a few paragraphs of English and French text from some novels, and then put in some random (short) sections of text from English and French Wikipedia respectively. It turned out that when using whole words or n-grams with n above 2 as features, it detects correctly.

Update: When using cosine similarity measure, even 1-gram and 2-gram feature detected correctly. Also I added both whole passage without splitting at line breaks and paragraphs(strings splitting at line breaks) as training sample.

Also, when I used each paragraph from the novels as training sample data points, the algorithm detected the languages of the unseen text correctly under the conditions stated above(whole words or n-grams with n greater than 2). However, when I used the whole section of multiple paragraphs as one training sample, the shorter french novel text (about half of the length of the English text) seems to have caused both the English and Frech Wikipedia short text to be identified as French.

Update: So, it seems like document length affects the results more than word occurrence when using Euclidean distance. While for cosine similarity, using a longer passage that contains more comprehensive vocabulary/letters/n-grams as training sample improved the accuracy.

Update: When I added sections of Mein Kampf as training text both Euclidean distance metric and cosine similarity detected two short French and English text as German when using 4-grams O.O. Also the similarity values doesn't seem to differ by a lot.

So,

  • Why do most of the libraries use naive Bayes classifier? Is it more suitable for this purpose?
  • Is it feasible to use k-NN for language identification? If yes, is there any advantage in using this over naive Bayes classifier? If not, why?
  • What other types of features could I have used, other than term frequency?
  • Would binary model(record of whether a term exists or not) features suffice for this purpose?
  • What is the best similarity/dissimilarity metric to use for this particular purpose?
  • Will k-d tree or k-d tree with best bin first approach help in speeding up the detection when there is a large training sample base? (There is likely to be hundreds of thousands of dimensions/features :P)
  • If cosine similarity distance metric is used, what data structure/algorithm could be used to reduce the time complexity of language detection? Can angles be precomputed, LSH be used, or k-d tree be used?

Please answer below as long as you know any of the sub-questions, or can provide an analysis to the above phenomenon mentioned 🙂

I'm not sure if I should put all the above questions in one. If not, please tell me and I will move the unrelated ones into a separate question.

Best Answer

Perhaps one reason that NB is used is that it's a very simple method that gets extremely high accuracy. I mean, extremely high. Detecting a language is not really an "open problem" any more. Perhaps, as someone mentioned, it may be worth trying to improve classification between very similar dialects or something.

Here is a report on research towards distinguishing Serbian from Croatian, Mandarin from Taiwan vs. mainland China, Indonesian vs. Malay,Portuguese from Brazil vs. Portugal, etc. State of the art accuracy for most of these is around 99%.