Solved – How does IV 2SLS obtain a causal coefficient

2slseconometricsendogeneityinstrumental-variables

Despite reading and conducting several practical examples with IV 2SLS, I am still uncertain how, specifically and mathematically, 2SLS is able to obtain a causal coefficient, β, of an assumed endogenous variable, X, on an outcome, Y.

From what I gather, 2SLS follows this logic:

First stage: We regress the endogenous variable, X, on all exogenous variables including the instrument(-s). We then store the predicted value of X.
Second stage: In the second stage regression, the predicted value of X now replaces the endogenous variable, consequently β now represents an "isolated" causal coefficient for X on Y.

At a conceptual level, I understand that the first stage somehow removes the correlation between the X variable and the error term, ϵ. So, when we in the second stage replace X with the predicted value of X we obtain a causal coefficient, β, for the effect of X on Y. However, I am unsure about the mathematics regarding the "isolation" of the causal effect of X on Y. Thus, the main question is what is the specific mathematical operation that makes β a causal coefficient for the effect of X on Y in the second stage regression?

Another post (What is an instrumental variable?) vividly describes how 2SLS can single out the explained and unexplained variation of an endogenous variable by the two-stage procedure. However, the example is based on a first stage that regresses the endogenous variable on the instrument, thereafter you plug the predicted value of X into the second stage regression. While illustrating, I am unsure how this translates to a more conventional example where you use 2SLS with an endogenous variable, multiple explanatory variables and one instrument.

Best Answer

Start from the structural model, $$y_i = \alpha + \beta X_i + \epsilon_i$$ where the explanatory variable of interest $X_i$ has a correlation with the error term, $Cov(X_i,\epsilon_i)\neq 0$. In this case, you know that you won't recover an unbiased estimate such that $\widehat{\beta}\rightarrow \beta$. Now assume that you have an instrument $Z_i$ which is such that $Cov(X_i,Z_i)\neq0$ and $Cov(Z_i,\epsilon_i)=0$. These are the assumptions on instrument relevance and the exclusion restriction.

Take the covariance of each sides of the above equation with respect to $Z_i$, and you get $$ \begin{align} Cov(y_i,Z_i) &= \beta Cov(X_i,Z_i) + Cov(\epsilon_i, Z_i)\\[0.5em] \beta &= \frac{Cov(y_i,Z_i)}{Cov(X_i,Z_i)} \end{align} $$ which uses the fact that the covariance between a random variable and a constant is zero, as well as the previous exclusion restriction. In fact, we just derived the expression of the IV estimator. The population coefficient $\beta$ can be recovered by dividing the "reduced form" (regression of $y_i$ on $Z_i$) coefficient by the first stage (regression of $X_i$ on $Z_i$) coefficient.

How does this relate to my answer in the other post? The denominator of the above fraction can be obtained by regressing, $$ X_i = \delta + \pi Z_i + \eta_i $$

Now you see that we have an expression for $X_i$ as a linear function of the instrument. If you plug this into the very first equation, you get the so-called reduced form equation: $$ \begin{align} y_i &= \alpha + \beta X_i + \epsilon_i \\ &= \alpha + \beta (\delta + \pi Z_i + \eta_i) + \epsilon_i \\ &= (\alpha + \beta\delta) + \beta \pi Z_i + \beta\eta_i\epsilon_i \end{align} $$

So the ratio of the reduced form coefficient on $Z_i$ over the first stage coefficient is indeed $$ \begin{align} \beta &= \frac{Cov(y_i,Z_i)}{Cov(X_i,Z_i)} \\ &= \frac{\beta\pi}{\pi}\\ &= \beta \end{align} $$ the causal effect of interest. This is the maths behind it. I hope that this together with the other answer gives you a better intuition on how an instrumental variable can be used to extract "exogenous" variation (under the stated assumptions) from the original $X_i$ to identify the parameter of interest.

Related Solutions

Solved – Basic 2SLS IV Questions in Stata

You need to include all your exogenous variables in both the first and the second stage as otherwise you might end up with biased estimates. For a discussion of why having some exogenous variables in the first but not in the second stage is problematic see here. Given your setup the correct syntax for Stata would be
ivregress 2sls Y exog1 exog2 exog3 exog4 (X = inst1 inst2)

As a side note: instead of ivregress you might want to use ivreg2 which is a user written command that provides many more diagnostic statistics for your 2SLS model.

For the interaction of the endogenous variable and exog3 you would also need to generate an interaction between the instruments and exog3. In a model like $$Y_i = \alpha + \beta_1 \text{exog1}_i + \beta_2 \text{exog2}_i + \beta_3 \text{exog3}_i + \beta_4 \text{exog4}_i + \gamma X_i + \epsilon_i$$ you said that you can instrument $X$ by running the first stage $$X_i = a + \rho_1 \text{exog1}_i + \rho_2 \text{exog2}_i + \rho_3 \text{exog3}_i + \rho_4 \text{exog4}_i + \phi_1 \text{inst1}_i + \phi_2 \text{inst2}_i + e_i $$ and then use the fitted values of this in the second stage. In the same spirit, if inst1 and inst2 are valid instruments for X, then inst1*exog3 and inst2*exog3 will be valid instruments for X*exog3, i.e. for a model $$Y_i = \alpha + \beta_1 \text{exog1}_i + \beta_2 \text{exog2}_i + \beta_3 \text{exog3}_i + \beta_4 \text{exog4}_i + \gamma \text{(X$_i$ $\cdot$ exog3$_i$)} + \eta_i$$ the first stage would be $ \begin{align} \text{(X$_i$ $\cdot$ exog3$_i$)} &= c + \delta_1 \text{exog1}_i + \delta_2 \text{exog2}_i + \delta_3 \text{exog3}_i + \delta_4 \text{exog4}_i + \psi_1 \text{(inst1 $\cdot$ exog3)}_i \newline &+ \psi_2 \text{(inst2 $\cdot$ exog3)}_i + u_i \end{align} $

In Stata the least complicated way would be to generate the interactions by hand

gen Xexog3 = X*exog3
gen inst1exog3 = inst1*exog3
gen inst2exog3 = inst2*exog3
ivregress 2sls Y exog1 exog2 exog3 exog4 (X Xexog3 = inst1 inst2 inst1exog3 inst2exog3)

This type of question has been asked before on the Statalist, so if you are interested in further discussion of the problem have a look here.

Solved – 2SLS – logit/probit in the second stage

The reference for this should be Newey (1987) "Efficient estimation of limited dependent variable models with endogenous explanatory variables", Journal of Econometrics, Vol. 36(3), pp. 231–250 (link). This is the estimator that is implemented with the probitiv command in Stata, for instance, where you can have an OLS first stage and probit second stage.

Best Answer

Related Solutions

Solved – Basic 2SLS IV Questions in Stata

Solved – 2SLS – logit/probit in the second stage

Related Question