Solved – Vowpal Wabbit: best strategy for short text data like titles & kewords

classificationmulti-classvowpal-wabbit

I am using Vowpal Wabbit 7.10.0 (VW) to learn and predict categories on text data. However, my text data for each record is not like an article or another decent-size text document, but rather a couple of sentences, like a title and subtitle and keywords.

I have around 10,000 labeled records I can use for validation, training, and testing, and around 1-2 millions unlabeled records. Its a multi-class problem with around 100 class labels, also imbalanced.

What would be the best pre-processing and input format to get the most of such data with VW?

My experience tells me that VW models should be sensitive to class imbalance problem. Here is another source that confirms it. Is that right?

As for choosing a model, I decided that I would rather take into account word combinations through n-grams then discover latent variables based on frequency counts (because texts are too short.) Besides, some texts tend to list a word for 100s of times (for SEO) in my data. Hence, I don't go TF-IDF. Is that right or not?
I guess, I can combine both n-grams and bag-of-words as different namespaces. But what classifier with what params to start with?

So far I tried it in three different ways of data pre-processing: (1) unprocessed text with only punctuation removed, (2) tokenization, lemmatization (not stemming), removed stopwords, (3) in addition to (2), bag of words, i.e. word:word_count format.

The results are not satisfactory with a very basic setting (this example used 16 classes, not 100):

  vw input.vw -c -k --passes 300 -b 24 --ect 16 -f model.vw
  vw input.vw -t -i model.vw -p preds.txt

Error rate is about 0.68 even on the training set.

I have some time limits to explore deeply all kind of setting, and really need quick and informative advise: what is the best pre-processing technique in my case, and what model implemented in the latest VW should I use. These two issues are related.

Best Answer

Here are some tips for enhancing the performance of VW models:

Shuffle the data prior to training. Having a non-random ordering of your dataset can really mess VW up.
You're already using multiple passes, which is good. Try also decaying the learning rate between passes, with --decay_learning_rate=.95.
Play around with the learning rate. I've had cases where --learning_rate=10 was great and other cases where --learning_rate-0.001 was great.
Try --oaa 16 or --log_multi 16 rather than --ect 16. I usually find ect to be less accurate. However, oaa is pretty slow. I've found --log_multi to be a good compromise between speed and accuracy. On 10,000 training examples, --oaa 16 should be fine.
Play with the loss function. --loss_function=hinge can sometimes yield large improvements in classification models.
Play with the --l1 and --l2 parameters, which regularize your model. --l2 in particular is useful with text data. Try something like --l2=1e-6.
For text data, try --ngram=2 and --skips=2 to add n-gram and skip grams to your models. This can help a lot.
Try --autolink=2 or --autolink=3 to fit a quadratic or cubic spline model.
Try ftrl optimization with --ftrl. This can be useful with text data or datasets with some extremely rare and some extremely common features.
Try some learning reductions:
1. Try a shallow neural network with --nn=1 or --nn=10.
2. Try a radial kernel svm with --ksvm --kernel=rbf --bandwidth=1. (This can be very slow).
3. Try a polynomial kernel svm with --ksvm --kernel=poly --degree=3. (This can be very slow).
4. Try a gbm with --boosting=25. This can be a little slow.

VW is extremely flexible, so it often takes a lot of fine tuning to get a good model on a given dataset. You can get a lot more tuning ideas here: https://github.com/JohnLangford/vowpal_wabbit/wiki/Command-line-arguments

Regarding the post you linked to: that person used vw with squared loss on an unbalanced classification problem. That's a silly thing to do, and pretty much guarantees that any linear model will always predict the dominant class. If you're worried about class balance, VW supports weights, so you can over-weight the rarer classes.

Edit: You have 100 classes and 10,000 training examples? That's an average of 100 observations per class, which isn't that many to learn from, no matter what model you use.

Best Answer

Related Solutions

Machine Learning – Large Scale Text Classification Techniques and Approaches

Solved – Is it feasible to use k-Nearest Neighbours to identify text language

Related Question