Solved – Classification of Huge number of classes

classificationoversampling

I have a dataset of samples belonging to >100 classes. I want to classify and/or cluster these classes. I have the following questions:

1) Is one classifier efficient for such problem? or one classifier for each one/subset of classes? (From my point of view: the efficient solution is to discover the features discriminate each class from all others and solve the problem as 1-to-all classification problem. Any suggestion on that?)

2) About 60% of these classes have 1 or 2 samples at maximum!. How can I create new samples from these 1-sample classes. Do you think any of SMOTE (synthetic minority oversampling technique) techniques are workable in this case.

Regards,

Best Answer

More than 100 classes shouldn't be a problem for most classification algorithms. However, if that number increases you should start thinking about new models for large-scale (in this case for the number of classes) classification. You can probably find some hint in this (a bit old) workshop about large-scale (hierarchical) text classification.

About the number of elements within classes, 1 or 2 elements is way too low. Based on my experience you need at least 10-20 examples per class, although this is dependent on several conditions such as type of data and collection.

To get new examples for some of the classes, have you considered some type of (semi-)manual labelling of documents to expand your training set?

I hope this helps.

Regards,