Solved – Multi-class text classification with a negative class

classificationtext mining

I have a multi-class short text classification task with a minor wrinkle: I'd like to also detect when the texts don't fit any of the classes well. I've tried to do it by simply adding unrelated texts into a separate class and learning an SVM, but with little success so far. Unsurprisingly, since 1) there are very many ways in which texts may not fit my classes, and 2) if I use too many examples for the "unrelated" class, the algorithm will simply learn to (nearly) always return it (and resampling reduces this to the previous problem).

I.e. this is a multi-class version of the problem solved by One-class SVMs. Are there standard solutions?

EDIT: I've come up with a possible solution (but not implemented or tested it yet).

Stage 1: a one-class classifier learned on the union of my classes (i.e. classify between relevant and irrelevant texts).

Stage 2: the usual multi-class classification if stage 1 says it's relevant.

Best Answer

There is no trivial solution. Simply because you want to somehow define the "every else" when that everything else is unbounded (the feature space is unbounded), and unfortunately it can not described by a limited dataset. Even worse what is within a class is not well defined, you just have samples that try to describe it...

You can try what you suggest: encapsulate the space of each class using a set of one-class SVMs, one per class, and check whether your unknown samples belong within any of those hyper-spheres or not. The tricky part will be to decide on the "hardness" boundary of these spheres. It is similar to the precision/recall tradeoff, but in your case the negative class is not really defined. One (theoretical?) solution, would be to create a dataset that defines the boundary of your classes. For example: someone can claim that everything is politics, so you need counter-examples of what is not politics for you, the closest to the boundary your counterexamples are the better (so no examples about cooking, but examples of interfere of government with some company which is the boundary between politics/business. Of course having that dataset you will end up with a simpler binary classification problem.

Related Question