Regression – Difference Between Statsmodel OLS and Scikit-Learn Linear Regression

pythonregressionscikit learnstatsmodels

I have a question about two different methods from different libraries which seems doing same job. I am trying to make linear regression model.

Here is the code which I using statsmodel library with OLS :

X_train, X_test, y_train, y_test = cross_validation.train_test_split(x, y, test_size=0.3, random_state=1)

x_train = sm.add_constant(X_train)
model = sm.OLS(y_train, x_train)
results = model.fit()

print "GFT + Wiki / GT  R-squared", results.rsquared

This print out GFT + Wiki / GT R-squared 0.981434611923

and the second one is scikit learn library Linear model method:

model = LinearRegression()
model.fit(X_train, y_train)

predictions = model.predict(X_test)

print 'GFT + Wiki / GT R-squared: %.4f' % model.score(X_test, y_test)

This print out GFT + Wiki / GT R-squared: 0.8543

So my question is the both method prints our R^2 result but one is print out 0.98 and the other one is 0.85.

From my understanding, OLS works with training dataset. So my questions,

  • Is there a way that work with test data set with OLS ?
  • Is the traning data set score gives us any meaning(In OLS we didn't use test data set)? From my past knowledge we have to work with test data.
  • What is the difference between OLS and scikit linear regression. Which one we use for calculating the score of the model ?

Thanks for any help.

Best Answer

First in terms of usage. You can get the prediction in statsmodels in a very similar way as in scikit-learn, except that we use the results instance returned by fit

predictions = results.predict(X_test)

Given the predictions, we can calculate statistics that are based on the prediction error

prediction_error = y_test - predictions

There is a separate list of functions to calculate goodness of prediction statistics with it, but it's not integrated into the models, nor does it include R squared. (I've never heard of R squared used for out of sample data.) Calculating those requires a bit more work by the user and statsmodels does not have the same set of statistics, especially not for classification or models with a binary response variable.

To your other two points:

Linear regression is in its basic form the same in statsmodels and in scikit-learn. However, the implementation differs which might produce different results in edge cases, and scikit learn has in general more support for larger models. For example, statsmodels currently uses sparse matrices in very few parts.

The most important difference is in the surrounding infrastructure and the use cases that are directly supported.

Statsmodels follows largely the traditional model where we want to know how well a given model fits the data, and what variables "explain" or affect the outcome, or what the size of the effect is. Scikit-learn follows the machine learning tradition where the main supported task is chosing the "best" model for prediction.

As a consequence, the emphasis in the supporting features of statsmodels is in analysing the training data which includes hypothesis tests and goodness-of-fit measures, while the emphasis in the supporting infrastructure in scikit-learn is on model selection for out-of-sample prediction and therefore cross-validation on "test data".

This points out the distinction, there is still quite a lot of overlap also in the usage. statsmodels also does prediction, and additionally forecasting in a time series context. But, when we want to do cross-validation for prediction in statsmodels it is currently still often easier to reuse the cross-validation setup of scikit-learn together with the estimation models of statsmodels.