Solved – Suitable number of classes for SVM in text categorization

classificatione1071multi-classrsvm

I'm doing text categorization with R and SVM in the package e1071. I have around 30000 text files for training, and 10000 for test. The goal is to hierarchically categorize these files. For example, I have 13 categories in level 1, such as sports, literature, politics, etc, and in the second level, there are more than 300 categories. For instance, below sports category, there are sub-categories, like football, basketball, rugby, etc.

There are two strategies to reach the categorization in level 2. First is to classify the files in first level (13 categories), and then recursively, classify the files among its own subcategories. Second strategy is more direct, i.e. we assign different labels to all categories (more than 300) in level 2, then we train the model with SVM.

For the second strategy, although I have used SVD to doc-term matrix, reducing its dimension to 30,000 * 10. The svm function in package e1071 still breaks down, giving the error cannot allocate vector of size 12.4 Gb.

So I'd like to ask you gurus, whether the large number of categories is a real problem for SVM? Specifically, in my case, which strategy will produce better results and is more feasible in practical ?

Best Answer

Following answer is based on my own personal insights of doing text analysis.

Of course, an increase in the number of categories will increase the time significantly since you have bigger matrix dimensions and so on. But it's not necessary a bad approach. Moreover first strategy looks somehow strange to me since the result of guessing subgroup maybe interfered with bad result of guessing the group (some subcategories can be significally different from other categories or subcategories, but the whole can not). So I would probably go with the second strategy.

In second approach you will need quite much computational power. The error you're getting is that your RAM memory is full (also swap if you have one). There are couple basic suggestions concerning this problem.

  1. Try to reduce your doc-term matrixes. That includes removing stopwords, punctuation, removing words that have no meaning whatsoever. This is very common procedure but sometime one can consider creating his own bigger filter.
  2. Don't use whole amount of articles, instead use only the sample. Well, sampling is one of the most simplest procedures to reduce the amount of computing.
  3. The most lazy solution, either get a pc with more operative memory or increase your swap memory and let computer do the rest.

These are more common approaches in your second case. I might would skip through the package called RTextTools that makes all of this work easier. Another insight would be using another approach rather than SVM. I'm not sure, but I think there are already implemented classification algorithms that implies that some of your categories have subcategories.

And as always don't forget to protect your progress when R crashes by saving workspace, .Rdata and then loading it. Also try to use R's garbage collector gc().

Related Question