Solved – How does IV 2SLS obtain a causal coefficient

2slseconometricsendogeneityinstrumental-variables

Despite reading and conducting several practical examples with IV 2SLS, I am still uncertain how, specifically and mathematically, 2SLS is able to obtain a causal coefficient, β, of an assumed endogenous variable, X, on an outcome, Y.

From what I gather, 2SLS follows this logic:

  1. First stage: We regress the endogenous variable, X, on all exogenous variables including the instrument(-s). We then store the predicted value of X.

  2. Second stage: In the second stage regression, the predicted value of X now replaces the endogenous variable, consequently β now represents an "isolated" causal coefficient for X on Y.

At a conceptual level, I understand that the first stage somehow removes the correlation between the X variable and the error term, ϵ. So, when we in the second stage replace X with the predicted value of X we obtain a causal coefficient, β, for the effect of X on Y. However, I am unsure about the mathematics regarding the "isolation" of the causal effect of X on Y. Thus, the main question is what is the specific mathematical operation that makes β a causal coefficient for the effect of X on Y in the second stage regression?

Another post (What is an instrumental variable?) vividly describes how 2SLS can single out the explained and unexplained variation of an endogenous variable by the two-stage procedure. However, the example is based on a first stage that regresses the endogenous variable on the instrument, thereafter you plug the predicted value of X into the second stage regression. While illustrating, I am unsure how this translates to a more conventional example where you use 2SLS with an endogenous variable, multiple explanatory variables and one instrument.

Best Answer

Start from the structural model, $$y_i = \alpha + \beta X_i + \epsilon_i$$ where the explanatory variable of interest $X_i$ has a correlation with the error term, $Cov(X_i,\epsilon_i)\neq 0$. In this case, you know that you won't recover an unbiased estimate such that $\widehat{\beta}\rightarrow \beta$. Now assume that you have an instrument $Z_i$ which is such that $Cov(X_i,Z_i)\neq0$ and $Cov(Z_i,\epsilon_i)=0$. These are the assumptions on instrument relevance and the exclusion restriction.

Take the covariance of each sides of the above equation with respect to $Z_i$, and you get $$ \begin{align} Cov(y_i,Z_i) &= \beta Cov(X_i,Z_i) + Cov(\epsilon_i, Z_i)\\[0.5em] \beta &= \frac{Cov(y_i,Z_i)}{Cov(X_i,Z_i)} \end{align} $$ which uses the fact that the covariance between a random variable and a constant is zero, as well as the previous exclusion restriction. In fact, we just derived the expression of the IV estimator. The population coefficient $\beta$ can be recovered by dividing the "reduced form" (regression of $y_i$ on $Z_i$) coefficient by the first stage (regression of $X_i$ on $Z_i$) coefficient.

How does this relate to my answer in the other post? The denominator of the above fraction can be obtained by regressing, $$ X_i = \delta + \pi Z_i + \eta_i $$

Now you see that we have an expression for $X_i$ as a linear function of the instrument. If you plug this into the very first equation, you get the so-called reduced form equation: $$ \begin{align} y_i &= \alpha + \beta X_i + \epsilon_i \\ &= \alpha + \beta (\delta + \pi Z_i + \eta_i) + \epsilon_i \\ &= (\alpha + \beta\delta) + \beta \pi Z_i + \beta\eta_i\epsilon_i \end{align} $$

So the ratio of the reduced form coefficient on $Z_i$ over the first stage coefficient is indeed $$ \begin{align} \beta &= \frac{Cov(y_i,Z_i)}{Cov(X_i,Z_i)} \\ &= \frac{\beta\pi}{\pi}\\ &= \beta \end{align} $$ the causal effect of interest. This is the maths behind it. I hope that this together with the other answer gives you a better intuition on how an instrumental variable can be used to extract "exogenous" variation (under the stated assumptions) from the original $X_i$ to identify the parameter of interest.