Solved – Pandas / Statsmodel / Scikit-learn

machine learningpandaspythonscikit learnstatsmodels

  1. Are Pandas, Statsmodels and Scikit-learn different implementations of machine learning/statistical operations, or are these complementary to one another?

  2. Which of these has the most comprehensive functionality?

  3. Which one is actively developed and/or supported?

  4. I have to implement logistic regression. Any suggestions as to which of these I should use?

Best Answer

  1. Scikit-learn (sklearn) is the best choice for machine learning, out of the three listed. While Pandas and Statsmodels do contain some predictive learning algorithms, they are hidden/not production-ready yet. Often, as authors will work on different projects, the libraries are complimentary. For example, recently Pandas' Dataframes were integrated into Statsmodels. A relationship between sklearn and Pandas is not present (yet).

  2. Define functionality. They all run. If you mean what is the most useful, then it depends on your application. I would definitely give Pandas a +1 here, as it has added a great new data structure to Python (dataframes). Pandas also probably has the best API.

  3. They are all actively supported, though I would say Pandas has the best code base. Sklearn and Pandas are more active than Statsmodels.

  4. The clear choice is Sklearn. It is easy and clear how to perform it.

    from sklearn.linear_models import LogisticRegression as LR
    logr = LR()
    logr.fit( X, Y )
    results = logr.predict( test_data)