Statsmodels – Why Does statsmodels.api.OLS Over-Report the R-Squared Value?

multiple regressionr-squaredscikit learnstatsmodels

I am using statsmodels.api.OLS to fit a linear regression model with 4 input-features.

The shape of the data is:

X_train.shape, y_train.shape  

Out[]: ((350, 4), (350,))

Then I fit the model and compute the r-squared value in 3 different ways:

import statsmodels.api as sm
import sklearn

ols = sm.OLS(y_train, X_train).fit()

y_pred = ols.predict(X_train)
res = y_train - y_pred

ss_tot = np.sum( (y_train - y_train.mean())**2 )
ss_res = np.sum( (y_train - y_pred)**2 )

(1 - ss_res/ss_tot), sklearn.metrics.r2_score(y_train, y_pred), ols.rsquared

Out[]: (0.91923900248372292, 0.91923900248372292, 0.99795455683297096)

The manually computed r-squared value and the value from sklearn.metrics.r2_score match exactly.
However, the ols.rsquared value seems to be highly over-estimated.

Why is this the case? How does statsmodels compute the rsquared value?

Best Answer

This is not technically an error in statsmodels, rather it is because statsmodels.OLS does not add the intercept/constant term to the right-hand-side of the regression equation by default -- you have to explicitly add it. In contrast, sklearn (and the vast majority of other regression programs) add the constant/intercept term by default unless it is explicitly suppressed.

To add the intercept term to statsmodels, use something like:

ols = sm.OLS(y_train, sm.add_constant(X_train)).fit()

The reason that omitting the intercept changes the $R^2$ is that a different definition of $R^2$ is used when there is no intercept.

We can view the usual $R^2$ as the proportional reduction in sum of squared errors between two models, A and B. $$ \text{A:} \space Y_i = \beta_0 + \beta_1X_i + e_i $$ $$ \text{B:} \space Y_i = \beta_0 + e_i $$ In words, we compare the performance of the model that includes $X$ as a predictor vs. a model that just predicts a constant value (the sample mean) for all observations.

When the intercept $\beta_0$ is omitted from model A to form a new model -- call it model C -- it no longer makes sense to compare this to the reduced model B (B is nested in A but it is not nested in C). So instead we adjust the computation of $R^2$ so that it can be viewed as the comparison between C and a new model D $$ \text{C:} \space Y_i = \beta_1X_i + e_i $$ $$ \text{D:} \space Y_i = 0 + e_i $$ In other words, we compare the slope-only model to a model that simply makes a constant prediction of 0 for all observations. This often paradoxically causes the $R^2$ to be even higher than before, but it's just because the reduced reference model D is absurd in most applications.

This and related issues are discussed a bit further in the following threads:

Removal of statistically significant intercept term increases $R^2$ in linear model

When forcing intercept of 0 in linear regression is acceptable/advisable

When is it ok to remove the intercept in a linear regression model?

Related Solutions

Solved – Does $r$-squared have a $p$-value

In addition to the numerous (correct) comments by other users pointing out that the $p$-value for $r^2$ is identical to the $p$-value for the global $F$ test, note that you can also get the $p$-value associated with $r^2$ "directly" using the fact that $r^2$ under the null hypothesis is distributed as $\textrm{Beta}(\frac{v_n}{2},\frac{v_d}{2})$, where $v_n$ and $v_d$ are the numerator and denominator degrees of freedom, respectively, for the associated $F$-statistic.

The 3rd bullet point in the Derived from other distributions subsection of the Wikipedia entry on the beta distribution tells us that:

If $X \sim \chi^2(\alpha)$ and $Y \sim \chi^2(\beta)$ are independent, then $\frac{X}{X+Y} \sim \textrm{Beta}(\frac{\alpha}{2}, \frac{\beta}{2})$.

Well, we can write $r^2$ in that $\frac{X}{X+Y}$ form.

Let $SS_Y$ be the total sum of squares for a variable $Y$, $SS_E$ be the sum of squared errors for a regression of $Y$ on some other variables, and $SS_R$ be the "sum of squares reduced," that is, $SS_R=SS_Y-SS_E$. Then $$ r^2=1-\frac{SS_E}{SS_Y}=\frac{SS_Y-SS_E}{SS_Y}=\frac{SS_R}{SS_R+SS_E} $$ And of course, being sums of squares, $SS_R$ and $SS_E$ are both distributed as $\chi^2$ with $v_n$ and $v_d$ degrees of freedom, respectively. Therefore, $$ r^2 \sim \textrm{Beta}(\frac{v_n}{2},\frac{v_d}{2}) $$ (Of course, I didn't show that the two chi-squares are independent. Maybe a commentator can say something about that.)

Demonstration in R (borrowing code from @gung):

set.seed(111)
x = runif(20)
y = 5 + rnorm(20)
cor.test(x,y)

# Pearson's product-moment correlation
# 
# data:  x and y
# t = 1.151, df = 18, p-value = 0.2648
# alternative hypothesis: true correlation is not equal to 0
# 95 percent confidence interval:
#  -0.2043606  0.6312210
# sample estimates:
#       cor 
# 0.2618393 

summary(lm(y~x))

# Call:
#   lm(formula = y ~ x)
# 
# Residuals:
#     Min      1Q  Median      3Q     Max 
# -1.6399 -0.6246  0.1968  0.5168  2.0355 
# 
# Coefficients:
#             Estimate Std. Error t value Pr(>|t|)    
# (Intercept)   4.6077     0.4534  10.163 6.96e-09 ***
# x             1.1121     0.9662   1.151    0.265    
# ---
#   Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# Residual standard error: 1.061 on 18 degrees of freedom
# Multiple R-squared:  0.06856,  Adjusted R-squared:  0.01681 
# F-statistic: 1.325 on 1 and 18 DF,  p-value: 0.2648

1 - pbeta(0.06856, 1/2, 18/2)

# [1] 0.2647731

Solved – Why report r-squared in Instrumental Variables Estimation

It's true that $R^2$ in instrumental variables regressions is not useful. Since one of the explanatory variables $x$ is correlated with the error $\epsilon$ we can't decompose the variance of the outcome $y$ into $\beta^2 Var(x) + Var(\epsilon)$, so the obtained $R^2$ neither has a natural interpretation nor can it be used for computation of F-tests for joint rejection. Also $R^2$ in instrumental variables regression can be negative and for this point it makes not difference for whether you use $$R^2 = \frac{MSS}{TSS} \quad \text{or} \quad R^2 = 1- \frac{RSS}{TSS}$$ because when $RSS>TSS$, then we also have that $MSS = TSS - RSS < 0$. In general the two expressions are the same so there should be no reason for why one would be more popular than the other. The issue is discussed in more length on the Stata website resources and support FAQs (link).

[edit] to address the additional question in the comment
When you instrument the endogenous variable $x$ with your instrument $z$ as $$x = \alpha + \pi z + \eta$$ you use the predicted values $\widehat{x}$ in the second stage $$y = a + \beta \widehat{x} + \epsilon$$ and if you do this procedure by hand in Stata like

reg x z
predict x_hat, xb
reg y x_hat

the standard errors will be calculated as $y - \widehat{x}\beta$ but these standard errors will be wrong. They are wrong because $\widehat{x}$ is an estimated quantity and not a random variable. The property of these standard errors though is that $RSS < TSS$ and there would be no negative $R^2$ and $\widehat{x}\beta$ is going to be a better predictor of $y$ than $\overline{y}$.

To calculate the corrected standard errors you use the actual values of the endogenous variable $x$ and not its fitted values when computing $e = y − x\beta$. The issue with this is that in this case you are computing the $RSS$ from a different set of regressors than those that are used to actually fit the model from which we take the $TSS$. For this reason it can happen that $x\beta$ is a worse predictor for $y$ than $\overline{y}$.

Best Answer

Related Solutions

Solved – Does $r$-squared have a $p$-value

Solved – Why report r-squared in Instrumental Variables Estimation

Related Question