Solved – improve precision in text classification

classificationfeature selectionmachine learningprecision-recalltext mining

I am working on binary text classification using sklearn:

  1. The length of each sample is not high (~ 200-500 characters)
  2. I use TF-IDF to get important words as
    TfidfVectorizer(sublinear_tf=False, max_df=0.5, stop_words='english', max_features = 5000)
  3. SGDClassifier is used as:
    SGDClassifier(loss='hinge', alpha=.0001, n_iter=50, penalty="l2", shuffle=False , class_weight='auto')

The classifier shows good recall for both binary classes (~80%) but poor precision for class-1 (~40%). as in my application, precision is more important than recall I am wondering how can I improve precision even by slightly reducing recall? I don't obsess with SGDClassifier. Other classifiers are ok.

Best Answer

It seems your system is too liberal. If you can generate an ROC curve or have some kind of threshold based classifier (I am not sure how SGDClassifier works) you could simply make the voting scheme more conservative. Assuming class-1 is "positive," you need to make more "negative" predictions perhaps by raising the threshold for classifying as "positive." Doing this will necessarily raise the number of negatives and reduce the number of positives, which will improve precision at the cost of reduced recall.