Solved – corpus specifically for categories like sports, entertainment, or health

categorical dataclassificationdatasetmachine learningtext mining

I am experimenting with Classification algorithms in ML and am looking for some corpus to train my model to distinguish among the different text categories like sports, weather, technology, football, cricket etc.

Where I can find some dataset with these categories?

An option would be to crawl Wikipedia for these 30+ categories. Is there a better way to do this?

Edit: I want to train the model using the bag of words approach for these categories, then classify new/unknown websites to these predefined categories depending on the content of the webpage.

Best Answer

scikit-learn 20-newsgroups-text-dataset has 11314 train + 7532 test samples with 10,000 or more sparse features. The newsgroup categories are:

alt.atheism
comp.graphics
comp.os.ms-windows.misc
comp.sys.ibm.pc.hardware
comp.sys.mac.hardware
comp.windows.x
misc.forsale
rec.autos
rec.motorcycles
rec.sport.baseball
rec.sport.hockey
sci.crypt
sci.electronics
sci.med
sci.space
soc.religion.christian
talk.politics.guns
talk.politics.mideast
talk.politics.misc
talk.religion.misc

For 2, 3, 4, 5 of these newsgroups (the worst ones) I get
83.2 82.6 82.2 80.6 % correct, using the fast sgd classifier.

(The first run of fetch_20newsgroups will take a while to download and cache the data.)

Related Question