I am running an instrumental variable regression using 'ivreg' command in R program.
I find that all my validity tests related to endogeneity are satisfied only except the R-squared value which is negative.
May I know whether I can ignore this negative R-squared value without reporting?
If not, what is an alternative manner to resolve this issue? The code is as below:
> Y_ivreg=ivreg(Y~x1+x2+x3+x4+x5+x6+x7|x2+x8+x9+x10+x5+x6+x7,data=DATA)
> summary(Y_ivreg,diagnostics=TRUE)
Call:
ivreg(formula = Y ~ x1 + x2 + x3 + x4 + x5 +
x6 + x7 | x2 + x8 + x9 + x10 +
x5 + x6 + x7, data = DATA)
Residuals:
Min 1Q Median 3Q Max
-0.747485 -0.053721 -0.009349 0.044285 1.085256
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.0979178 0.0319244 3.067 0.00218 **
x1 0.0008438 0.0004927 1.712 0.08691 .
x2 0.0018515 0.0009135 2.027 0.04277 *
x3 -0.0130133 0.0073484 -1.771 0.07667 .
x4 -0.0018486 0.0009552 -1.935 0.05303 .
x5 -0.0000294 0.0000126 -2.333 0.01971 *
x6 0.0018214 0.0008908 2.045 0.04096 *
x7 -0.0024457 0.0005488 -4.456 8.61e-06 ***
Diagnostic tests:
df1 df2 statistic p-value
Weak instruments (x1) 3 3313 185.440 <2e-16 ***
Weak instruments (x3) 3 3313 3861.526 <2e-16 ***
Weak instruments (x4) 3 3313 3126.315 <2e-16 ***
Wu-Hausman 3 3310 1.943 0.121
Sargan 0 NA NA NA
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1142 on 3313 degrees of freedom
Multiple R-Squared: -0.009029, Adjusted R-squared: -0.01116
Wald test: 4.231 on 7 and 3313 DF, p-value: 0.0001168
There is a Stata post link related to this issue and ivregression for your reference:
https://www.stata.com/support/faqs/statistics/two-stage-least-squares/
Best Answer
Yes, the linked STATA post answers your question in a single sentence:
How can $R^2$ be negative?
Wikipedia has a great visualization of $R^2$:
On the left, we see the $\color{red}{\text{total sum of squares}}$, obtained by using the mean ($\bar{y}$) as a prediction:
$${\text{total sum of squares}} = \sum_{i = 1}^n (y_i - \bar{y})^2$$
On the right, we see the $\color{blue}{\text{residual sum of squares}}$, obtained by using the model's predictions ($\hat{y}$):
$${\text{residual sum of squares}} = \sum_{i = 1}^n (y_i - \hat{y})^2 = \sum_{i = 1}^n \bigg(y_i - \Big( \hat{\beta}_0 + \sum_{j = 1}^p \hat{\beta}_j \cdot x_j \Big) \bigg)^2$$
Ordinarily, $R^2 = 1 - \frac{\color{blue}{\text{residual sum of squares}}}{\color{red}{\text{total sum of squares}}} \geq 0$, because any model with an intercept ($\beta_0$) should perform at least as well as the image on the left (the intercept could simply be the mean).
However, if you interpret instrumental variable regression as a two-stage linear regression, it is easy to show why it could end up being negative. Namely, suppose the endogenous variables ($\mathbf{X}$) are regressed on the exogenous variables ($\mathbf{Z}$), and the predicted values ($\hat{\mathbf{X}}$) are then used as covariates in the second stage:
$$\text{Stage 1:} \quad \mathbf{X} = \mathbf{Z\delta} + \text{error} \\ \text{Stage 2:} \quad \mathbf{y} = \hat{\mathbf{X}}\mathbf{\beta} + \text{error}$$
Since $\hat{\mathbf{X}} \neq \mathbf{X}$, the error that is minimized in the second stage is not the same as the error used to calculate the residual sum of squares. Consequently, the residual sum of squares need not be less than the total sum of squares anymore. (And more importantly, the $R^2$ has become meaningless.)