Solved – How Residuals of Instrumental Variables Estimation are calculated and why you can have a negative R-squared

2slsr-squaredregressionresidualsstata

I would like to understand, precisely, why you can have a negative $R^2$ with a 2SLS estimation, such as you have in commands like ivreg2 in Stata. There is reference for such an occurrence for ivregress here: http://tinyurl.com/qbb9o9s. It states the residuals are calculated with the endogenous variables of the structural model, but it also states that the estimation does not have a nested constant-only version of the model — this last detail is probably what makes possible the $R^2$ to be negative, I guess. I know that an estimation without the constant can produce a negative R-squared (I understand this), but I cannot say for sure if not having a nested constant-only version of the model is in any way similar. Could someone help me understand how the residuals are calculated?

Best Answer

First of all, ask yourself whether your instruments are actually strong enough to warrant the usage of TSLS. As you perhaps know from Bound et al. (1995), your estimates can be badly biased and inconsistent with 2SLS, see for example here. Moreover, you should do an F test for the first stage and check whether it's about ten.

Even better, use robust test statistics. Ivreg2 and condivreg have some available but only for one endogenous regressor under conditional homoskedasticity. The R square value is usually useless for inference. Check whether your coefficients are statistically significant, first using a t-test and then Anderson-Rubin confidence intervals as given by condivreg.

These intervals may be infinitely large which will then correspond to your instrument strength.

Related Solutions

Solved – Why report r-squared in Instrumental Variables Estimation

It's true that $R^2$ in instrumental variables regressions is not useful. Since one of the explanatory variables $x$ is correlated with the error $\epsilon$ we can't decompose the variance of the outcome $y$ into $\beta^2 Var(x) + Var(\epsilon)$, so the obtained $R^2$ neither has a natural interpretation nor can it be used for computation of F-tests for joint rejection. Also $R^2$ in instrumental variables regression can be negative and for this point it makes not difference for whether you use $$R^2 = \frac{MSS}{TSS} \quad \text{or} \quad R^2 = 1- \frac{RSS}{TSS}$$ because when $RSS>TSS$, then we also have that $MSS = TSS - RSS < 0$. In general the two expressions are the same so there should be no reason for why one would be more popular than the other. The issue is discussed in more length on the Stata website resources and support FAQs (link).

[edit] to address the additional question in the comment
When you instrument the endogenous variable $x$ with your instrument $z$ as $$x = \alpha + \pi z + \eta$$ you use the predicted values $\widehat{x}$ in the second stage $$y = a + \beta \widehat{x} + \epsilon$$ and if you do this procedure by hand in Stata like

reg x z
predict x_hat, xb
reg y x_hat

the standard errors will be calculated as $y - \widehat{x}\beta$ but these standard errors will be wrong. They are wrong because $\widehat{x}$ is an estimated quantity and not a random variable. The property of these standard errors though is that $RSS < TSS$ and there would be no negative $R^2$ and $\widehat{x}\beta$ is going to be a better predictor of $y$ than $\overline{y}$.

To calculate the corrected standard errors you use the actual values of the endogenous variable $x$ and not its fitted values when computing $e = y − x\beta$. The issue with this is that in this case you are computing the $RSS$ from a different set of regressors than those that are used to actually fit the model from which we take the $TSS$. For this reason it can happen that $x\beta$ is a worse predictor for $y$ than $\overline{y}$.

Solved – 2SLS with endogenous interaction term

High level, if you have a valid instrument for endog, then the interaction term between your instrument for endog and link will be a valid instrument for the interaction. So no problems there.

I lack the intuition for dealing with the second part of the problem. So why not simulate? In R:

# Data generating process
n <- 100000
# Create instruments
z1 <- rnorm(n); z2 <- rnorm(n)

# Create unobserved variables
naughty1 <- rnorm(n); naughty2 <- rnorm(n)

# Create endogenous regressors
x1 <- z1 + naughty1 + rnorm(n)
x2 <- z2 + naughty2 + rnorm(n)

# Create outcome variable 
y <- 1 + 0.5*x1 + 0.2*x2 + 0.7*naughty1 -0.5*naughty2 + 2*x1*x2  + rnorm(n)

# Verify that we get biased estimates
mod <- lm(y ~ x1 + x2 + x1:x2)
summary(mod)
# We get fairly strong bias for the coefficients on x1 and x2

# Now we use your method
x1hat <- predict(lm(x1 ~ z1))
x2hat <- predict(lm(x2 ~ z2))

mod2 <- lm(y ~ x1hat + x2hat + x1hat:x2hat)
summary(mod2)
# Hey presto - it works; no more bias.

Naturally you'd need to boostrap the confidence intervals to correct for the two-stages. But what you've proposed should work. Note that it's not particularly efficient. Even in this case, I had to dial the number of observations up pretty high before it started zeroing in on the true parameters.

Best Answer

Related Solutions

Solved – Why report r-squared in Instrumental Variables Estimation

Solved – 2SLS with endogenous interaction term

Related Question