Regression – Difference Between Statsmodel OLS and Scikit-Learn Linear Regression

pythonregressionscikit learnstatsmodels

I have a question about two different methods from different libraries which seems doing same job. I am trying to make linear regression model.

Here is the code which I using statsmodel library with OLS :

X_train, X_test, y_train, y_test = cross_validation.train_test_split(x, y, test_size=0.3, random_state=1)

x_train = sm.add_constant(X_train)
model = sm.OLS(y_train, x_train)
results = model.fit()

print "GFT + Wiki / GT  R-squared", results.rsquared

This print out GFT + Wiki / GT R-squared 0.981434611923

and the second one is scikit learn library Linear model method:

model = LinearRegression()
model.fit(X_train, y_train)

predictions = model.predict(X_test)

print 'GFT + Wiki / GT R-squared: %.4f' % model.score(X_test, y_test)

This print out GFT + Wiki / GT R-squared: 0.8543

So my question is the both method prints our R^2 result but one is print out 0.98 and the other one is 0.85.

From my understanding, OLS works with training dataset. So my questions,

Is there a way that work with test data set with OLS ?
Is the traning data set score gives us any meaning(In OLS we didn't use test data set)? From my past knowledge we have to work with test data.
What is the difference between OLS and scikit linear regression. Which one we use for calculating the score of the model ?

Thanks for any help.

Best Answer

First in terms of usage. You can get the prediction in statsmodels in a very similar way as in scikit-learn, except that we use the results instance returned by fit

predictions = results.predict(X_test)

Given the predictions, we can calculate statistics that are based on the prediction error

prediction_error = y_test - predictions

There is a separate list of functions to calculate goodness of prediction statistics with it, but it's not integrated into the models, nor does it include R squared. (I've never heard of R squared used for out of sample data.) Calculating those requires a bit more work by the user and statsmodels does not have the same set of statistics, especially not for classification or models with a binary response variable.

To your other two points:

Linear regression is in its basic form the same in statsmodels and in scikit-learn. However, the implementation differs which might produce different results in edge cases, and scikit learn has in general more support for larger models. For example, statsmodels currently uses sparse matrices in very few parts.

The most important difference is in the surrounding infrastructure and the use cases that are directly supported.

Statsmodels follows largely the traditional model where we want to know how well a given model fits the data, and what variables "explain" or affect the outcome, or what the size of the effect is. Scikit-learn follows the machine learning tradition where the main supported task is chosing the "best" model for prediction.

As a consequence, the emphasis in the supporting features of statsmodels is in analysing the training data which includes hypothesis tests and goodness-of-fit measures, while the emphasis in the supporting infrastructure in scikit-learn is on model selection for out-of-sample prediction and therefore cross-validation on "test data".

This points out the distinction, there is still quite a lot of overlap also in the usage. statsmodels also does prediction, and additionally forecasting in a time series context. But, when we want to do cross-validation for prediction in statsmodels it is currently still often easier to reuse the cross-validation setup of scikit-learn together with the estimation models of statsmodels.

Related Solutions

Solved – Pandas / Statsmodel / Scikit-learn

Scikit-learn (sklearn) is the best choice for machine learning, out of the three listed. While Pandas and Statsmodels do contain some predictive learning algorithms, they are hidden/not production-ready yet. Often, as authors will work on different projects, the libraries are complimentary. For example, recently Pandas' Dataframes were integrated into Statsmodels. A relationship between sklearn and Pandas is not present (yet).
Define functionality. They all run. If you mean what is the most useful, then it depends on your application. I would definitely give Pandas a +1 here, as it has added a great new data structure to Python (dataframes). Pandas also probably has the best API.
They are all actively supported, though I would say Pandas has the best code base. Sklearn and Pandas are more active than Statsmodels.

The clear choice is Sklearn. It is easy and clear how to perform it.

from sklearn.linear_models import LogisticRegression as LR
logr = LR()
logr.fit( X, Y )
results = logr.predict( test_data)

Solved – Different output for R lm() and python statsmodel OLS for linear regression

OLS in statsmodels has currently no option to drop singular columns.

statsmodels OLS is using the Moore-Penrose generalized inverse, pinv, to solve the linear least squares problem. This means that the reported covariance has reduced rank. However, we can use estimable contrasts to get and test the effects for which the covariance is of full rank.

If a user wants to have a full rank solution with statsmodels, the user has to decide which of the collinear columns to drop.

One way to find collinear columns is described here https://stackoverflow.com/questions/13312498/how-to-find-degenerate-rows-columns-in-a-covariance-matrix

(Most likely an option to automatically drop collinear columns will be available in a future version of statsmodels.)

Best Answer

Related Solutions

Solved – Pandas / Statsmodel / Scikit-learn

Solved – Different output for R lm() and python statsmodel OLS for linear regression

Related Question