I suggest two alternatives, that have been extensively used in Text Classification:
- Using Latent Semantic Indexing, which consists of applying Singular Value Decomposition to the DocumentXTerm matrix in order to identify relevant (concept) components, or in other words, aims to group words into classes that represent concepts or semantic fields.
- Using a lexical database like WordNet or BabelNet concepts in order to index the documents, allowing semantic-level comparison of documents. This approach is not statistical, and it faces a problem with Word Sense Disambiguation.
Both methods can be applied before training. None of the them aim at catching word order.
I think the most detailed answers can be found in Mehryar Mohri's extensive work on the topic. Here's a link to one of his lecture slides on the topic: https://web.archive.org/web/20151125061427/http://www.cims.nyu.edu/~mohri/amls/lecture_3.pdf
The problem of language detection is that human language (words) have structure. For example, in English, it's very common for the letter 'u' to follow the letter 'q,' while this is not the case in transliterated Arabic. n-grams work by capturing this structure. Thus, certain combinations of letters are more likely in some languages than others. This is the basis of n-gram classification.
Bag-of-words, on the other hand, depends on searching through a large dictionary and essentially doing template matching. There are two main drawbacks here: 1) each language would have to have an extensive dictionary of words on file, which would take a relatively long time to search through, and 2) bag-of-words will fail if none of the words in the training set are included in the testing set.
Assuming that you are using bigrams (n=2) and there are 26 letters in your alphabet, then there are only 26^2 = 676 possible bigrams for that alphabet, many of which will never occur. Therefore, the "profile" (to use language detector's words) for each language needs a very small database. A bag-of-words classifier, on-the-other-hand would need a full dictionary for EACH language in order to guarantee that a language could be detected based on whichever sentence it was given.
So in short - each language profile can be quickly generated with a relatively small feature space. Interestingly, n-grams only work because letters are not drawn iid in a language - this is explicitly leverage.
Note: the general equation for the number of n-grams for words is l^n where l is the number of letters in the alphabet.
Best Answer
It's very useful.
In text classification using bag of words you can routinely run into tasks where number of features is much bigger than number of examples. That means when you try to fit linear model, you'll have problems, since the corresponding linear system is underdetermined.
L2 regularization is often used to deal with underdetermined linear systems. L1 is used for this too, and it also has an advantage of enforcing sparsity, thus making model simpler to interpret (if you want to read more about that, you can read about maximum a posteriori estimation or bayesian linear models).
As an example you can see this page from scikit-learn documentation.