Solved – Text categorization/classification for small scale text

classificationmachine learningsvmtext mining

I'm looking into a way to classify/categorize sentences into pre-defined categories (around 10-15). Yes, indeed sentences, not articles or paragraphs.

Given the average length of articles are not too long (2-10 normal pages) and number of articles (tens-hundreds) are relatively small, it is quite a small scale problem. However, accuracy is much more important.

Because I am quite new in this field, I started by looking into some introductory papers and few generic open source projects (e.g. WEKA, GATE and LingPipe ). However, what I found so far are pieces too hard to put together to fit my purpose.

What specific algorithm/software/resource do you recommend on this problem?

Thanks in advance.


I did find a tool called TagHelper which is quite suitable for few aspects of my purpose and am still exploring it.

But I still expect more insights and suggestions. Thanks!

Best Answer

In my limited experience short texts don't make things appreciably harder. Word count data is hopelessly sparse anyway, so vigorous regularisation of some sort is needed anyway. Basically, you've got a document classification problem with really small documents.

Since you're already looking at Lingpipe, you may find the book Text Analysis with Lingpipe helpful. It's a work in progress, but the basics of classification are there. For a more introductory exposition, without accompanying software package, there also Manning et al.'s Introduction to Information Retrieval (don't be fooled by the title).

Since 'accuracy is important' the precision, recall, and confusion matrix discussions in these texts will be important to understand. You may find that some categories simply cannot be reliably distinguished, a fact which is important to communicate and/or that one of recall and precision is much more important to you than the other, so you'll want to threshold your classifier decisions carefully.

Related Question