I am using statsmodels.api.OLS
to fit a linear regression model with 4 input-features.
The shape of the data is:
X_train.shape, y_train.shape
Out[]: ((350, 4), (350,))
Then I fit the model and compute the r-squared value in 3 different ways:
import statsmodels.api as sm
import sklearn
ols = sm.OLS(y_train, X_train).fit()
y_pred = ols.predict(X_train)
res = y_train - y_pred
ss_tot = np.sum( (y_train - y_train.mean())**2 )
ss_res = np.sum( (y_train - y_pred)**2 )
(1 - ss_res/ss_tot), sklearn.metrics.r2_score(y_train, y_pred), ols.rsquared
Out[]: (0.91923900248372292, 0.91923900248372292, 0.99795455683297096)
The manually computed r-squared value and the value from sklearn.metrics.r2_score
match exactly.
However, the ols.rsquared
value seems to be highly over-estimated.
Why is this the case? How does statsmodels
compute the rsquared value?
Best Answer
This is not technically an error in
statsmodels
, rather it is becausestatsmodels.OLS
does not add the intercept/constant term to the right-hand-side of the regression equation by default -- you have to explicitly add it. In contrast,sklearn
(and the vast majority of other regression programs) add the constant/intercept term by default unless it is explicitly suppressed.To add the intercept term to
statsmodels
, use something like:The reason that omitting the intercept changes the $R^2$ is that a different definition of $R^2$ is used when there is no intercept.
We can view the usual $R^2$ as the proportional reduction in sum of squared errors between two models, A and B. $$ \text{A:} \space Y_i = \beta_0 + \beta_1X_i + e_i $$ $$ \text{B:} \space Y_i = \beta_0 + e_i $$ In words, we compare the performance of the model that includes $X$ as a predictor vs. a model that just predicts a constant value (the sample mean) for all observations.
When the intercept $\beta_0$ is omitted from model A to form a new model -- call it model C -- it no longer makes sense to compare this to the reduced model B (B is nested in A but it is not nested in C). So instead we adjust the computation of $R^2$ so that it can be viewed as the comparison between C and a new model D $$ \text{C:} \space Y_i = \beta_1X_i + e_i $$ $$ \text{D:} \space Y_i = 0 + e_i $$ In other words, we compare the slope-only model to a model that simply makes a constant prediction of 0 for all observations. This often paradoxically causes the $R^2$ to be even higher than before, but it's just because the reduced reference model D is absurd in most applications.
This and related issues are discussed a bit further in the following threads:
Removal of statistically significant intercept term increases $R^2$ in linear model
When forcing intercept of 0 in linear regression is acceptable/advisable
When is it ok to remove the intercept in a linear regression model?