Solved – Support vector machine for text classification

kernel tricklibsvmmachine learningsvmtext mining

I am currently having a data set, class 1 with about 8000 short text files and class 2 with about 3000 short text files. I applied LibSVM and tried a couple of parameter combinations in the cross-validation experiment.

Generally the class 1 precision falls into the range of (85%, 90%); the class 2 precision falls into the range of (70% , 75%); the recall of both class 1 and class 2 fall into the range of (80% , 85%).

For the text classification purposes, I built text feature space following the common approaches, tokening the document, filtering the stopwords and building the word vector using tf-idf or binary frequency, etc. I also tried n-gram model to build the feature space. But these approaches did not improve the performance a lot. I would like to know are there any other ways that may help tune the LibSVM to improve the performance. LibSVM provides grid search for parameter setting up, but it runs pretty slow.

Best Answer

LibSVM hasn't been getting reliable performance for me, of late. Have you tried using SVMLight ever?

You might also try looking at which features are showing the most predictive power in your model, and adding some sort of enriched-type feature. For example, if I were classifying documents on whether they contain information related to protein-protein interaction, I wouldn't really care about the specific names of the proteins as predictive features. I would pre-process my documents and normalize all protein mentions with some common term that wouldn't normally occur in my documents, like "THISWORDUSEDTOBEAPROTEIN". Previous research (sorry, I can't think of any citations off the top of my head except my own paper Ambert & Cohen, 2012) has shown that this can lead to improved classifier performance by preventing the classifier from getting distracted by common genes (e.g., ADH1A) and ignoring rare ones, instead combining the predictive power of all genes into a single feature you could think of as "GeneMentioned".

Related Question