Solved – Same model coeffs, different R^2 with statsmodels OLS and sci-kit learn linearregression

machine learningregressionscikit learnstatsmodels

I'm getting to know Python regression tools with the intention of benchmarking against ML tools available on a couple of cloud based services. I'm using the boston dataset distributed with scikit-learn, and am testing with both statsmodels OLS and scikit-learn linearregression. The two models are identical: no Y intercept (not clear why this fits better, but that's what I'm seeing), same two IVs plus an interaction term, same DV. And the models give the same beta coefficients on the independent variables. The only difference I'm noticing is in the R^2 values.

#statsmodels
X = bos[['RM', 'LSTAT', 'RMxLSTAT']]
y = target['MEDV']
model = sm.OLS(y, X).fit()
predictions = model.predict(X)
model.summary()

Gives

  • RM 5.3906
  • LSTAT 0.9631
  • RMxLSTAT -0.306
  • R-squared 0.957

#scikit-learn
X = bos[['RM', 'LSTAT', 'RMxLSTAT']]
y = bos[['MEDV']]
bos_linreg = LinearRegression(fit_intercept=False)
bos_linreg.fit(X, y)
print(f'Coefficients: {bos_linreg.coef_}')
print(f'Intercept: {bos_linreg.intercept_}')
print(f'R^2 score: {bos_linreg.score(X, y)}')

Gives

  • Coefficients: [[ 5.39059216 0.96309233 -0.30632371]]
  • Intercept: 0.0
  • R^2 score: 0.7009604508111584

I've seen similar questions posted, but haven't seen an answer that applies. What am I missing?

Thanks, community.

P.S. Random: Why won't code formatting work for the second code block?

Best Answer

A reproducible example is always appreciated.

There are various R^2 definitions, but the standard ones give identical results when your model has an intercept. When there is no intercept, as you can see below, no two of these definitions agree:

import numpy as np
from sklearn.linear_model import LinearRegression
import pandas as pd
import statsmodels.api as sm
from numpy.testing import assert_allclose

n = 100

for intercept in False, True:

    X = np.random.normal(size=(n, 3))
    if intercept:
        print("\nWith intercept:")
        X[:, 0] = 1
    else:
        print("Without intercept:")
    y = 2 + X[:, 0] - X[:, 1] + np.random.normal(size=n)
    sm_model = sm.OLS(y, X).fit()
    sm_predictions = sm_model.predict(X)
    print("Statsmodels:      ", sm_model.rsquared)
    print("Corr(y, y_hat)^2: ", np.corrcoef(y, sm_predictions)[0, 1]**2)
    centered_resid = sm_model.resid - sm_model.resid.mean()
    centered_y = y - y.mean()
    print("1 - ssr_u/tss_c=  ", 1 - np.sum(sm_model.resid**2) / np.sum(centered_y**2))
    print("1 - ssr_c/tss_c=  ", 1 - np.sum(centered_resid**2) / np.sum(centered_y**2))
    print("1 - ssr_u/tss_u=  ", 1 - np.sum(sm_model.resid**2) / np.sum(y**2))

    sk_model = LinearRegression(fit_intercept=False)
    sk_model.fit(X, y)
    print("sklearn:          ", sk_model.score(X, y))

    assert_allclose(sm_model.params, sk_model.coef_)

Below is the result:

Without intercept:
Statsmodels:       0.14261994999123073
Corr(y, y_hat)^2:  0.43150433543477057
1 - ssr_u/tss_c=   -1.266391530932561
1 - ssr_c/tss_c=   0.4296360038850652
1 - ssr_u/tss_u=   0.14261994999123062
sklearn:           -1.266391530932561

With intercept:
Statsmodels:       0.5664929089923038
Corr(y, y_hat)^2:  0.5664929089923038
1 - ssr_u/tss_c=   0.5664929089923038
1 - ssr_c/tss_c=   0.5664929089923039
1 - ssr_u/tss_u=   0.9268815196720518
sklearn:           0.5664929089923038
Related Question