Statsmodels – Why Does statsmodels.api.OLS Over-Report the R-Squared Value?

multiple regressionr-squaredscikit learnstatsmodels

I am using statsmodels.api.OLS to fit a linear regression model with 4 input-features.

The shape of the data is:

X_train.shape, y_train.shape  

Out[]: ((350, 4), (350,))

Then I fit the model and compute the r-squared value in 3 different ways:

import statsmodels.api as sm
import sklearn

ols = sm.OLS(y_train, X_train).fit()

y_pred = ols.predict(X_train)
res = y_train - y_pred

ss_tot = np.sum( (y_train - y_train.mean())**2 )
ss_res = np.sum( (y_train - y_pred)**2 )

(1 - ss_res/ss_tot), sklearn.metrics.r2_score(y_train, y_pred), ols.rsquared

Out[]: (0.91923900248372292, 0.91923900248372292, 0.99795455683297096)

The manually computed r-squared value and the value from sklearn.metrics.r2_score match exactly.
However, the ols.rsquared value seems to be highly over-estimated.

Why is this the case? How does statsmodels compute the rsquared value?

Best Answer

This is not technically an error in statsmodels, rather it is because statsmodels.OLS does not add the intercept/constant term to the right-hand-side of the regression equation by default -- you have to explicitly add it. In contrast, sklearn (and the vast majority of other regression programs) add the constant/intercept term by default unless it is explicitly suppressed.

To add the intercept term to statsmodels, use something like:

ols = sm.OLS(y_train, sm.add_constant(X_train)).fit()

The reason that omitting the intercept changes the $R^2$ is that a different definition of $R^2$ is used when there is no intercept.

We can view the usual $R^2$ as the proportional reduction in sum of squared errors between two models, A and B. $$ \text{A:} \space Y_i = \beta_0 + \beta_1X_i + e_i $$ $$ \text{B:} \space Y_i = \beta_0 + e_i $$ In words, we compare the performance of the model that includes $X$ as a predictor vs. a model that just predicts a constant value (the sample mean) for all observations.

When the intercept $\beta_0$ is omitted from model A to form a new model -- call it model C -- it no longer makes sense to compare this to the reduced model B (B is nested in A but it is not nested in C). So instead we adjust the computation of $R^2$ so that it can be viewed as the comparison between C and a new model D $$ \text{C:} \space Y_i = \beta_1X_i + e_i $$ $$ \text{D:} \space Y_i = 0 + e_i $$ In other words, we compare the slope-only model to a model that simply makes a constant prediction of 0 for all observations. This often paradoxically causes the $R^2$ to be even higher than before, but it's just because the reduced reference model D is absurd in most applications.

This and related issues are discussed a bit further in the following threads:

Removal of statistically significant intercept term increases $R^2$ in linear model

When forcing intercept of 0 in linear regression is acceptable/advisable

When is it ok to remove the intercept in a linear regression model?