Solved – Wildly different $R^2$ between linear regression in statsmodels and sklearn

pythonregressionscikit learnstatsmodels

My question is related to:

Difference between statsmodel OLS and scikit linear regression

I essentially have the same problem, except my results are even more substantially different. Performing the following simple linear regressions, I get almost completely opposite results for the coefficient of determination:

import statsmodels.api as sm
from sklearn import linear_model

    x1 = [26.0, 31.0, 47.0, 51.0, 50.0, 49.0, 37.0, 33.0, 49.0, 54.0, 31.0, 49.0, 48.0, 49.0, 49.0, 47.0, 44.0, 48.0, 35.0, 43.0]
    y1 = [116.0, 94.0, 100.0, 102.0, 116.0, 116.0, 68.0, 118.0, 91.0, 104.0, 78.0, 116.0, 90.0, 109.0, 116.0, 118.0, 108.0, 119.0, 110.0, 102.0]

# Fit and summarize statsmodel OLS model
model_sm = sm.OLS(x1, y1)
result_sm = model_sm.fit()
print(result_sm.summary())


# Create sklearn linear regression object
ols_sk = linear_model.LinearRegression(fit_intercept=True)

# fit model
model_sk = ols_sk.fit(pd.DataFrame(x1), pd.DataFrame(y1))

# sklearn coefficient of determination
coefofdet = model_sk.score(pd.DataFrame(x1), pd.DataFrame(y1))

print('sklearn R^2: ' + str(coefofdet))

Statsmodels give me an $R^2$ of 0.962, while sklearn gives me an $R^2$ of 0.0584069073664.

What is causing such a drastic difference?

Best Answer

In your scikit-learn model, you included an intercept using the fit_intercept=True method. This fit both your intercept and the slope.

In statsmodels, if you want to include an intercept, you need to run the command x1 = stat.add_constant(x1) in order to create a column of constants. Then running the sm.OLS() command would yield an R-squared value of around 0.056.

It's also important to note that when constructing a model in statsmodels, you want to put your y1 first and x1 second rather than x1, then y1. The arguments are reversed from statsmodels to scikit-learn.