Machine Learning – Comprehensive Guide Using Python

I am considering using Python libraries for doing my Machine Learning experiments. Thus far, I had been relying on WEKA, but have been pretty dissatisfied on the whole. This is primarily because I have found WEKA to be not so well supported (very few examples, documentation is sparse and community support is less than desirable in my experience), and have found myself in sticky situations with no help forthcoming. Another reason I am contemplating this move is because I am really liking Python (I am new to Python), and don't want to go back to coding in Java.

So my question is, what are the more

comprehensive
scalable (100k features, 10k examples) and
well supported libraries for doing ML in Python out there?

I am particularly interested in doing text classification, and so would like to use a library that has a good collection of classifiers, feature selection methods (Information Gain, Chi-Sqaured etc.), and text pre-processing capabilities (stemming, stopword removal, tf-idf etc.).

Based on the past e-mail threads here and elsewhere, I have been looking at PyML, scikits-learn and Orange so far. How have people's experiences been with respect to the above 3 metrics that I mention?

Any other suggestions?

Best Answer

About the scikit-learn option: 100k (sparse) features and 10k samples is reasonably small enough to fit in memory hence perfectly doable with scikit-learn (same size as the 20 newsgroups dataset).

Here is a tutorial I gave at PyCon 2011 with a chapter on text classification with exercises and solutions:

http://scikit-learn.github.com/scikit-learn-tutorial/ (online HTML version)
https://github.com/downloads/scikit-learn/scikit-learn-tutorial/scikit_learn_tutorial.pdf (PDF version)
https://github.com/scikit-learn/scikit-learn-tutorial (source code + exercises)

I also gave a talk on the topic which is an updated version of the version I gave at PyCon FR. Here are the slides (and the embedded video in the comments):

http://www.slideshare.net/ogrisel/statistical-machine-learning-for-text-classification-with-scikitlearn-and-nltk

As for feature selection, have a look at this answer on quora where all the examples are based on the scikit-learn documentation:

http://www.quora.com/What-are-some-feature-selection-methods/answer/Olivier-Grisel

We don't have collocation feature extraction in scikit-learn yet. Use nltk and nltk-trainer to do this in the mean time:

https://github.com/japerk/nltk-trainer

Best Answer

Related Solutions

Solved – Using Python for building machine learning application

Solved – Machine Learning Book (Python)

Related Question