Machine Learning – Comprehensive Guide Using Python

machine learningpython

I am considering using Python libraries for doing my Machine Learning experiments. Thus far, I had been relying on WEKA, but have been pretty dissatisfied on the whole. This is primarily because I have found WEKA to be not so well supported (very few examples, documentation is sparse and community support is less than desirable in my experience), and have found myself in sticky situations with no help forthcoming. Another reason I am contemplating this move is because I am really liking Python (I am new to Python), and don't want to go back to coding in Java.

So my question is, what are the more

  1. comprehensive
  2. scalable (100k features, 10k examples) and
  3. well supported libraries for doing ML in Python out there?

I am particularly interested in doing text classification, and so would like to use a library that has a good collection of classifiers, feature selection methods (Information Gain, Chi-Sqaured etc.), and text pre-processing capabilities (stemming, stopword removal, tf-idf etc.).

Based on the past e-mail threads here and elsewhere, I have been looking at PyML, scikits-learn and Orange so far. How have people's experiences been with respect to the above 3 metrics that I mention?

Any other suggestions?

Best Answer

About the scikit-learn option: 100k (sparse) features and 10k samples is reasonably small enough to fit in memory hence perfectly doable with scikit-learn (same size as the 20 newsgroups dataset).

Here is a tutorial I gave at PyCon 2011 with a chapter on text classification with exercises and solutions:

I also gave a talk on the topic which is an updated version of the version I gave at PyCon FR. Here are the slides (and the embedded video in the comments):

As for feature selection, have a look at this answer on quora where all the examples are based on the scikit-learn documentation:

We don't have collocation feature extraction in scikit-learn yet. Use nltk and nltk-trainer to do this in the mean time: