The approach of training the classifier on single-class "sentences" (actually, not necessarily sentences, but phrases) is promising, but where it breaks down is you are training the classifier on a different type of data than what you're using it on.
You need a pre-processing step that breaks the sentences down into phrases, and use that same phrase-level data for both training and actual classification. Then, in your reporting step, you could aggregate all of the positive phrase-level results to the level of the input sentence (each sentence is a list of phrases with some positive identification of classes, and so you would simply combine all of the phrase-level positive results).
This approach isn't perfect either - it will often fail properly account for contextual information like negation and pronouns. But, it's the next logical step if you're building it from the ground up.
This is more of an extended comment as you have not given sufficient information to give detailed advice. Also, I have no experience with such a large-scale problem, and I suspect few really has. You say "I am designing a scikit learn classifier which has 5000+ categories and training data is at least 80 million and may grow upto an additional 100 million each year." which is a HUGE problem, and probably a major research project. You should take time to look at some papers describing similar efforts, like http://vision.stanford.edu/documents/DengBergLiFei-Fei_ECCV2010.pdf which describes trying to classify millions of images into 1000+ categories. I will cite a few paragraphs to show the inmensity of the project:
In practice, all algorithms are parallelized on a computer cluster of
66 multicore machines, but it still takes weeks for a single run of
all our experiments. Our experience demonstrates that computational
issues need to be confronted at the outset of algorithm design when we
move toward large scale image classification, otherwise even a
baseline evaluation would be infeasible.
weeks, for a single run of one experiment, on a cluster of 66 machines
Do you have the resources for such a project?
If not, and even then, you should start out with some simplified project, see how that goes, and continue from that.
One idea: with thousands of categories, there must be some hierarchcal structure to the space of categories. If you can start mapping out that space, maybe organizing the categories in a binary tree, you could try a binary classifier for each level of the tree. Just a thought!
Another idea: mapping out the space of categories something like in multidimensional scaling .... would give coordinates to the categories, and then you could build a predictor for those coordinates. Something like that could work, or not, we do not know until somebody tries! I guess this is really white spots on the map ...
Good luck!
Best Answer
Scikit learn implementation of the SVM binary classifier does not let you set a cutoff threshold as the other comments/replies have suggested. Instead of giving class probabilities, it straighaway applies a default cutoff to give you the class membership e.g. 1 or 2.
To minimize false negatives, you could set higher weights for training samples labeled as the positive class, by default the weights are set to 1 for all classes. To change this, use the hyper-parameter
class_weight
.Ideally, you should avoid choosing a cutoff and simply provide the class probabilities to the end users who can then decide on which cutoff to apply when making decisions based on the classifier.
A better metric to compare classifiers is a proper scoring function, see https://en.wikipedia.org/wiki/Scoring_rule and the
score()
method in the svm classifier modulesklearn.svm.SVC
.