I'm getting to know Python regression tools with the intention of benchmarking against ML tools available on a couple of cloud based services. I'm using the boston dataset distributed with scikit-learn, and am testing with both statsmodels OLS and scikit-learn linearregression. The two models are identical: no Y intercept (not clear why this fits better, but that's what I'm seeing), same two IVs plus an interaction term, same DV. And the models give the same beta coefficients on the independent variables. The only difference I'm noticing is in the R^2 values.
#statsmodels
X = bos[['RM', 'LSTAT', 'RMxLSTAT']]
y = target['MEDV']
model = sm.OLS(y, X).fit()
predictions = model.predict(X)
model.summary()
Gives
- RM 5.3906
- LSTAT 0.9631
- RMxLSTAT -0.306
- R-squared 0.957
#scikit-learn
X = bos[['RM', 'LSTAT', 'RMxLSTAT']]
y = bos[['MEDV']]
bos_linreg = LinearRegression(fit_intercept=False)
bos_linreg.fit(X, y)
print(f'Coefficients: {bos_linreg.coef_}')
print(f'Intercept: {bos_linreg.intercept_}')
print(f'R^2 score: {bos_linreg.score(X, y)}')
Gives
- Coefficients: [[ 5.39059216 0.96309233 -0.30632371]]
- Intercept: 0.0
- R^2 score: 0.7009604508111584
I've seen similar questions posted, but haven't seen an answer that applies. What am I missing?
Thanks, community.
P.S. Random: Why won't code formatting work for the second code block?
Best Answer
A reproducible example is always appreciated.
There are various R^2 definitions, but the standard ones give identical results when your model has an intercept. When there is no intercept, as you can see below, no two of these definitions agree:
Below is the result: