Solved – A multi-label classification for tagging short text

classificationdata miningmachine learningpredictive-modelstext mining

I am fairly new in the area of text mining and want to practice my skills a little. I have the following task at hand which I want to work on. I have a large list of short texts (~100.000) and every text has on average 3 tags / labels assigned to it. What I want now is train a predictive model, with which I can assign labels to new unseen texts. The assumption about the labels is, all existing labels are included in my dataset.

I started with python sklearn and a basic tf.idf matrix representation combined with an one-vs.-rest support vector classification for each label, but the mean f1 score for that model are very low. Are there better ways of creating a predictive model for that kind of task?

As I mentioned before, this area is quite new to me so a little advice on how to tackle this problem would be more than welcome (e.g. some literature ?).

Best Answer

You'll want to familiarize yourself with multi-label classification, to better understand the problem you're working on. Tsoumakas et al is a good review to start with. Python scikit actually has multilabel classification functionality built in, so that might be a good out-of-the-box solution for you!

Related Question