Solved – Simple text classifier: classification taking forever

classificationmachine learningpythonscikit learntext mining

I work for a small tech startup, and I want to classify our users into demographics based on the domain of their email address. When users sign up to our site, they can enter a job category or pick "other". The goal is to classify as many of the "other" type as possible using a bag-of-words approach.

To do this, I have written some code in Python. For each user, I look at the domain name of their email address and scrape the text from their homepage (using Beautiful Soup). I also look for an "About Us" page, which I also scrape. What I'm left with is a map of domains to text. Some domains are classified (i.e., users whose email address comes from this domain have self-classified their job types), and some aren't (those users who have self-classified as "other"). The total data set for classified users is about 2000 (neglecting domains like gmail and hotmail and [I can't believe I'm about to type this] aol). I'm using a train/test split of 75/25.

Using scikit-learn, I'm trying to implement a simple classifier, but there seems to be an issue with either convergence or performance. The data set doesn't seem particularly big, but the two classifiers I've tried (Perceptron and RidgeClassifier) seem to be having some issues finding a fit. I haven't really tried to change the parameters for the classifiers, and it's not clear to me which nobs I should be turning.

I lack intuition into this problem, and it's difficult for me to tell whether the issues I'm having are due to not enough data, or what. I'd like to know

Am I barking up the wrong tree? Has anyone tried something like this and made it work?
Do other ML packages for Python do a better job of text classification? (I'm looking at you, nltk.)
Is my data set large enough? Are there any "rules-of-thumb" for how much data you'd need for something like this (~5-10 categories)?
What's a reasonable amount of time for the learning to take? Are there any hints that will tell me the difference between "this is really hard" and "this isn't going to work"?
I've tried to follow the examples here and here. These examples are pretty speedy, so it makes me worry that I don't have enough data to make things work nicely. Is the "20 newsgroups" classification problem typical, or does it show up because it's easily solvable?

Any guidance here would be appreciated!

As an update: the huge performance hit seems to come from the "vectorizer": that is, the thing that maps a vector of words to the reals. For some reason, Tfidf was taking a long time to do its thing—I switched to a different vectorizer, and now things run quite quickly.

In regards to the actual learning, I've found that the Naive Bayes routines work pretty well out of the box (f-score around 70-75%, which is good enough for now). The model that I found works the best, however, is one based on a linear SVM (scikits.svm.LinearSVC), which gets me somewhere in the 80-85% range with a bit of tinkering.

Best Answer

I think the issue here may be the use of term vectors. Your instances (bags of words) are translated to a vector of probably 150 to 10000 dimensions. Each word that occurs in your corpus (the websites) is one dimension and the value for each instance is the frequency (it the tf/idf score) of that word in the given website.

In a space with that many dimensions, most machine learning algorithms will suffer. You've chosen fairly lightweight algorithms, but they may still take a while to converge, depending on how they're implemented.

The most common classifier in this scenario is naive Bayes, which doesn't see the instance space as a high dimensional space, but just as a collection of frequencies (from which it estimates, using Bayes theorem, the probability of each class). Training this classifier should take as long as it takes to read the data once, and classification should take as long as it takes to read the instance. Since it shouldn't have any parameters, it will at least give you a good baseline. Nltk almost certainly has this algorithm (it's the mother of spam detection).

Another option, if you want to use more traditional ML algorithms, is to reduce the dimensionality of the dataset to something manageable (anything below 50), using PCA. This will take more time, and make it more difficult to update your classifier, but it can lead to good performance.

Related Solutions

Text Classification – Bag-of-Words vs. Word Frequencies vs. TFIDF

You're correct that the supervised learner can often be redundant with TF-IDF weighting. Here's the basic outline of why: In one typical form of TF-IDF weighting, the rescaling is logarithmic, so the weighting for a word $w$ in a document $d$ is $$ \text{TF-IDF}(w,d) = (\text{no. occurrences of $w$ in $d$}) \cdot f(w) $$ for $N$ the number of documents in the corpus and $f(w)=\log\left(\frac{N}{\text{no. documents containing $w$}}\right)$. When $f(w)>0$, TF-IDF just amounts to a rescaling of the term frequency. So if we write the matrix counting the number of occurrences of a word in each document as $X$, then a linear model has the form $X\beta$. If we use TF-IDF instead of just term frequency alone, the linear model can be written as $X(k I)\tilde{\beta}$, where $k$ is a vector storing all of our weights $k_i=f(w_i)$. The effect of $kI$ is to rescale each column of $X$. In this setting, the choice to use TF-IDF or TF alone is inconsequential, because you'll get the same predictions. Using the substitution $(kI)\tilde{\beta}=\beta$, we can see the effect is to rescale $\beta$.

But there are at least two scenarios where the choice to use TF-IDF is consequential for supervised learning.

The first case is when $f(w)=0$. This happens whenever a term occurs in every document, such as very common words like "and" or "the." In this case, TF-IDF will zero out the column in $X(kI)$, resulting in a matrix which is not full-rank. A rank-deficient matrix is often not preferred for supervised learning, so instead these words are simply dropped from $X$ because they add no information. In this way, TF-IDF provides automatic screening for the most common words.

The second case is when the matrix $X(kI)$ has its document vectors rescaled to the same norm. Since a longer document is very likely to have a much larger vocabulary than a shorter document, it can be hard to compare documents of different lengths. Rescaling each document vector will also suppress importance rare words in the document independently of how rare or common the word is in the corpus. Moreover, rescaling each document's vector to have the same norm after computing TF-IDF gives a design matrix which is not a linear transformation of $X$, so original matrix cannot be recovered using a linear scaling.

Rescaling the document vectors has a close connection to cosine similarity, since both methods involve comparing unit-length vectors.

The popularity of TF-IDF in some settings does not necessarily impose a limitation on the methods you use. Recently, it has become very common to use word and token vectors that are either pre-trained on a large corpus or trained by the researcher for their particular task. Depending on what you're doing and scale of the data, and the goal of your analysis, it might be more expedient to use TD-IDF, word2vec, or another method to represent natural language information.

A number of resources can be found here, which I reproduce for convenience.

K. Sparck Jones. "A statistical interpretation of term specificity and its application in retrieval". Journal of Documentation, 28 (1). 1972.
G. Salton and Edward Fox and Wu Harry Wu. "Extended Boolean information retrieval". Communications of the ACM, 26 (11). 1983.
G. Salton and M. J. McGill. "Introduction to modern information retrieval". 1983
G. Salton and C. Buckley. "Term-weighting approaches in automatic text retrieval". Information Processing & Management, 24 (5). 1988.
H. Wu and R. Luk and K. Wong and K. Kwok. "Interpreting TF-IDF term weights as making relevance decisions". ACM Transactions on Information Systems, 26 (3). 2008.

Solved – Python Text Classification Features Engineering

If you have not done so yet I would suggest to test

 CountVectorizer(binary=True,encoding='utf-8',decode_error='replace',strip_accents='unicode'
                  ,analyzer='word')

and play with the parameters with a grid search

 parameters={'alpha': [1e-2,1e-3,1e-4,1e-5,1e-6,1e-7],'n_iter':[10,20,30,100,200,300] }
 clf=GridSearchCV(estimator=(SGDClassifier(penalty='l2',random_state=42
                  ,class_weight='balanced'))

Best Answer

Related Solutions

Text Classification – Bag-of-Words vs. Word Frequencies vs. TFIDF

Solved – Python Text Classification Features Engineering

Related Question