Solved – Classifier and Technique to use for large number of categories

classificationlarge datamany-categoriesscikit learn

I am designing a scikit learn classifier which has 5000+ categories and training data is at least 80 million and may grow upto an additional 100 million each year. I have already tried with all the categories but it generates classifiers in the order of 100s of GBs binary file with very poor accuracy. So I think that having one classifier for each category would be helpful and would also help me to fine tune features for each category thereby improving accuracy, but this means 5k+ classifiers for each of these categories. So how to handle this large data requirements and which incremental classifiers to use for this case , considering the fact that I will keep on getting additional training data as well as may discover new categories?

Update :

I ran the experiment on 1 lac samples and found that using boosted decision trees gives a accuracy of 65% on validation set which is better then all other classifiers I tried. So will increasing the training data help improve accuracy?
I found that increasing the training data size incrementally going from 80k samples to 1 lac samples just provides an additional 2-3 % increase in accuracy. So will increasing training set size increase accuracy? and will using LSTMs and RNN further increase the accuracy ?

The number of features are about 120 which are mostly text based and most are categorical with text based values with large cardinality i.e many features may have huge number of possible values and available RAM IS 128gb with 12 core CPU.

Best Answer

This is more of an extended comment as you have not given sufficient information to give detailed advice. Also, I have no experience with such a large-scale problem, and I suspect few really has. You say "I am designing a scikit learn classifier which has 5000+ categories and training data is at least 80 million and may grow upto an additional 100 million each year." which is a HUGE problem, and probably a major research project. You should take time to look at some papers describing similar efforts, like http://vision.stanford.edu/documents/DengBergLiFei-Fei_ECCV2010.pdf which describes trying to classify millions of images into 1000+ categories. I will cite a few paragraphs to show the inmensity of the project:

In practice, all algorithms are parallelized on a computer cluster of 66 multicore machines, but it still takes weeks for a single run of all our experiments. Our experience demonstrates that computational issues need to be confronted at the outset of algorithm design when we move toward large scale image classification, otherwise even a baseline evaluation would be infeasible.

weeks, for a single run of one experiment, on a cluster of 66 machines

Do you have the resources for such a project?

If not, and even then, you should start out with some simplified project, see how that goes, and continue from that.

One idea: with thousands of categories, there must be some hierarchcal structure to the space of categories. If you can start mapping out that space, maybe organizing the categories in a binary tree, you could try a binary classifier for each level of the tree. Just a thought!

Another idea: mapping out the space of categories something like in multidimensional scaling .... would give coordinates to the categories, and then you could build a predictor for those coordinates. Something like that could work, or not, we do not know until somebody tries! I guess this is really white spots on the map ...

Good luck!

Best Answer

Related Solutions

Machine Learning – How to Compare Two Classifiers with Unlimited Training Data

Update about bootstapping / cross validation

Data-driven model optimization

Machine Learning Classification – Why Classifiers Need Same Prevalence in Training and Testing Sets

Related Question