I am experimenting with Classification algorithms in ML and am looking for some corpus to train my model to distinguish among the different text categories like sports, weather, technology, football, cricket etc.
Where I can find some dataset with these categories?
An option would be to crawl Wikipedia for these 30+ categories. Is there a better way to do this?
Edit: I want to train the model using the bag of words approach for these categories, then classify new/unknown websites to these predefined categories depending on the content of the webpage.
Best Answer
scikit-learn 20-newsgroups-text-dataset has 11314 train + 7532 test samples with 10,000 or more sparse features. The newsgroup categories are:
For 2, 3, 4, 5 of these newsgroups (the worst ones) I get
83.2 82.6 82.2 80.6 % correct, using the fast sgd classifier.
(The first run of
fetch_20newsgroups
will take a while to download and cache the data.)