Most of the text classifiers are based on the bag-of-words approach where you loose the context that a particular word appears. As a solution (or simple solution?) we can use n-grams as features. But are there any classifiers which "gist" the idea and model it in someway before training?
Solved – Alternatives to bag-of-words based classifiers for text classification
classificationmachine learningtext mining
Related Solutions
I think the most detailed answers can be found in Mehryar Mohri's extensive work on the topic. Here's a link to one of his lecture slides on the topic: https://web.archive.org/web/20151125061427/http://www.cims.nyu.edu/~mohri/amls/lecture_3.pdf
The problem of language detection is that human language (words) have structure. For example, in English, it's very common for the letter 'u' to follow the letter 'q,' while this is not the case in transliterated Arabic. n-grams work by capturing this structure. Thus, certain combinations of letters are more likely in some languages than others. This is the basis of n-gram classification.
Bag-of-words, on the other hand, depends on searching through a large dictionary and essentially doing template matching. There are two main drawbacks here: 1) each language would have to have an extensive dictionary of words on file, which would take a relatively long time to search through, and 2) bag-of-words will fail if none of the words in the training set are included in the testing set.
Assuming that you are using bigrams (n=2) and there are 26 letters in your alphabet, then there are only 26^2 = 676 possible bigrams for that alphabet, many of which will never occur. Therefore, the "profile" (to use language detector's words) for each language needs a very small database. A bag-of-words classifier, on-the-other-hand would need a full dictionary for EACH language in order to guarantee that a language could be detected based on whichever sentence it was given.
So in short - each language profile can be quickly generated with a relatively small feature space. Interestingly, n-grams only work because letters are not drawn iid in a language - this is explicitly leverage.
Note: the general equation for the number of n-grams for words is l^n where l is the number of letters in the alphabet.
You're correct that the supervised learner can often be redundant with TF-IDF weighting. Here's the basic outline of why: In one typical form of TF-IDF weighting, the rescaling is logarithmic, so the weighting for a word $w$ in a document $d$ is $$ \text{TF-IDF}(w,d) = (\text{no. occurrences of $w$ in $d$}) \cdot f(w) $$ for $N$ the number of documents in the corpus and $f(w)=\log\left(\frac{N}{\text{no. documents containing $w$}}\right)$. When $f(w)>0$, TF-IDF just amounts to a rescaling of the term frequency. So if we write the matrix counting the number of occurrences of a word in each document as $X$, then a linear model has the form $X\beta$. If we use TF-IDF instead of just term frequency alone, the linear model can be written as $X(k I)\tilde{\beta}$, where $k$ is a vector storing all of our weights $k_i=f(w_i)$. The effect of $kI$ is to rescale each column of $X$. In this setting, the choice to use TF-IDF or TF alone is inconsequential, because you'll get the same predictions. Using the substitution $(kI)\tilde{\beta}=\beta$, we can see the effect is to rescale $\beta$.
But there are at least two scenarios where the choice to use TF-IDF is consequential for supervised learning.
The first case is when $f(w)=0$. This happens whenever a term occurs in every document, such as very common words like "and" or "the." In this case, TF-IDF will zero out the column in $X(kI)$, resulting in a matrix which is not full-rank. A rank-deficient matrix is often not preferred for supervised learning, so instead these words are simply dropped from $X$ because they add no information. In this way, TF-IDF provides automatic screening for the most common words.
The second case is when the matrix $X(kI)$ has its document vectors rescaled to the same norm. Since a longer document is very likely to have a much larger vocabulary than a shorter document, it can be hard to compare documents of different lengths. Rescaling each document vector will also suppress importance rare words in the document independently of how rare or common the word is in the corpus. Moreover, rescaling each document's vector to have the same norm after computing TF-IDF gives a design matrix which is not a linear transformation of $X$, so original matrix cannot be recovered using a linear scaling.
Rescaling the document vectors has a close connection to cosine similarity, since both methods involve comparing unit-length vectors.
The popularity of TF-IDF in some settings does not necessarily impose a limitation on the methods you use. Recently, it has become very common to use word and token vectors that are either pre-trained on a large corpus or trained by the researcher for their particular task. Depending on what you're doing and scale of the data, and the goal of your analysis, it might be more expedient to use TD-IDF, word2vec, or another method to represent natural language information.
A number of resources can be found here, which I reproduce for convenience.
K. Sparck Jones. "A statistical interpretation of term specificity and its application in retrieval". Journal of Documentation, 28 (1). 1972.
G. Salton and Edward Fox and Wu Harry Wu. "Extended Boolean information retrieval". Communications of the ACM, 26 (11). 1983.
G. Salton and M. J. McGill. "Introduction to modern information retrieval". 1983
G. Salton and C. Buckley. "Term-weighting approaches in automatic text retrieval". Information Processing & Management, 24 (5). 1988.
H. Wu and R. Luk and K. Wong and K. Kwok. "Interpreting TF-IDF term weights as making relevance decisions". ACM Transactions on Information Systems, 26 (3). 2008.
Best Answer
I suggest two alternatives, that have been extensively used in Text Classification:
Both methods can be applied before training. None of the them aim at catching word order.