Solved – Comprehensive dataset for documents classification

classificationtext mining

In document classification (document categorization) field, researchers recommend always a few standard dataset, such as Reuters-21578, RCV1(Reuters Corpus Volume 1). However, these datasets only contain documents related to specific domain, for instance, newswire stories.

However, if we really want to classify any given documents (1 million) in a more wide range of topics, like literature, sports, politics, science, etc. How can we accomplish this task?

If the classification (supervised learning) doesn't work, could anyone tell me some more advanced methods to automatically categorize any topic documents ?

Best Answer

If the classification (supervised learning) doesn't work, could anyone tell me some more advanced methods to automatically categorize any topic documents ?

There's also clustering (unsupervised learning). It will try to see distinct difference between your articles. The only problem there is that you can't actually tell what group is what, since it's only based on difference (euclidean norm or other). But that could be helpful too since maybe some articles that goes into the "sport" category can appear to be very similar to the "leasure" articles. Looking at it that way may help you to redefine your groups.

However, if we really want to classify any given documents (1 million) in a more wide range of topics, like literature, sports, politics, science, etc. How can we accomplish this task?

If you want computer automatically complete this stuff, by looking at the articles then classification is the only way (that I know). Since it looks like it doesn't work correctly for you, it's not necessary statistically unmeaningful thing. But I higly doubt that there's no distinct difference can be found in 1 million articles, so again, maybe you're creating too much groups?

If you're looking for another corpus texts you should try prebuilt text from RTextTools. There's a set called NYTimes (again for articles) and USCongress (something different, a sample dataset containing labeled bills from the United States Congress)

You're questions looks a bit general to me, so maybe I understood it incorrectly. So correct me if I wrong.