I'm doing text categorization with R and SVM in the package e1071. I have around 30000 text files for training, and 10000 for test. The goal is to hierarchically categorize these files. For example, I have 13 categories in level 1, such as sports, literature, politics, etc, and in the second level, there are more than 300 categories. For instance, below sports category, there are sub-categories, like football, basketball, rugby, etc.
There are two strategies to reach the categorization in level 2. First is to classify the files in first level (13 categories), and then recursively, classify the files among its own subcategories. Second strategy is more direct, i.e. we assign different labels to all categories (more than 300) in level 2, then we train the model with SVM.
For the second strategy, although I have used SVD to doc-term matrix, reducing its dimension to 30,000 * 10. The svm function in package e1071 still breaks down, giving the error cannot allocate vector of size 12.4 Gb.
So I'd like to ask you gurus, whether the large number of categories is a real problem for SVM? Specifically, in my case, which strategy will produce better results and is more feasible in practical ?
Best Answer
Following answer is based on my own personal insights of doing text analysis.
Of course, an increase in the number of categories will increase the time significantly since you have bigger matrix dimensions and so on. But it's not necessary a bad approach. Moreover first strategy looks somehow strange to me since the result of guessing subgroup maybe interfered with bad result of guessing the group (some subcategories can be significally different from other categories or subcategories, but the whole can not). So I would probably go with the second strategy.
In second approach you will need quite much computational power. The error you're getting is that your RAM memory is full (also swap if you have one). There are couple basic suggestions concerning this problem.
These are more common approaches in your second case. I might would skip through the package called
RTextTools
that makes all of this work easier. Another insight would be using another approach rather than SVM. I'm not sure, but I think there are already implemented classification algorithms that implies that some of your categories have subcategories.And as always don't forget to protect your progress when R crashes by saving workspace, .Rdata and then loading it. Also try to use R's garbage collector
gc()
.