Logistic – Minimum Training Set Size Required for Document Classification Based on Feature Count

classificationlogisticsample-sizetext mining

For document classification problems, is there a rule of thumb for the number of training instances required for the number of terms in the vocabulary?

I am using a logistic regression classifier with TF-IDF weighted features. After stop-word filtering, stemming, and filtering by minimum and maximum document frequencies, I have a vocabulary of ~13,000 terms for a training set with ~20,000 documents. I have attempted using LDA for dimensionality reduction by adding topic probabilities as features, but this did not significantly affect performance. The performance of a classifier trained only on LDA topic probability features was inferior to the performance of the classifiers trained on TF-IDF features and TF-IDF+LDA topic probability features.

Best Answer

If you want to use something like plain (i.e., unregularized) logistic regression, then the argument cited by Jose stands: you need the ratio #features/#examples to be small.

However, you can do much better by using regularization. Popular forms of regularization include L1 (lasso) and L2 (ridge) regularization. If you regularize your features properly, you can get away with having many more features than training examples. In this paper, for example, using "dropout regularization" allowed us to train an efficient logistic regression model with 25k training examples and 5 million features.

Finally, if you have even fewer training examples (say, a few hundred examples and a million features), you may need to go even further and use generative models models like naive Bayes. As explained here, you may get away with having a number of training examples that grows logarithmically with the number of features.