Causal Interpretation – Understanding Instrumental Variables and Endogeneity in Regression

econometricsendogeneityregression

My question is simple. Why do we want to perform instrumental variable estimation?
That may sound silly but let me explain. Suppose we have the following population model:

$ y_{i}= \beta_{0}+\beta_{1}X_{1}+\beta_{2}X_{2}+\varepsilon_{i} $

Where $E[X_{1}\varepsilon_{i}] \neq 0$. Therefore $ X_{1} $ is an endogenous variable, which implies that OLS is biased and inconsistent, and that we need some instrument to consistently estimate $ \beta_{1} $ using 2SLS.

My problem is that I do not understand why you would want to find $ \beta_{1} $ in the first place. If endogeneity is present in the model, then $ \beta_{1} $ surely does not codify any causation, and, if it has no causal interpretation, then it would make no sense to try to estimate it?

As pointed out in Angrist and Pischke (2008): "a regression inherits its legitimacy from a CEF", which in turn requires that no variables are endogenous. To be more exact, we want the following to be true:

$ E[Y|X]= \beta_{0}+\beta_{1}X_{1}+\beta_{2}X_{2} \implies \frac{\delta E[Y|X]}{\delta X_{1}} = \beta_{1} $

Thus we would be able to make our usual causal statement without any worries. However, this would definitely not hold if $ X_{1} $ were to be endogenous. Would it still have the same interpretation as before? Could we still interpret it causally in the population? If not, why bother with IVs?

Best Answer

The model you wrote down was (I am excluding $X_2$ since it is not obvious that it plays any role in any of the discussion, but modify your question to clarify if $X_2$ is important to you) $$Y = \beta_0 + \beta_1 X + \varepsilon$$ The confusion, I believe, stems from the fact that the model as written contains ambiguities. For a rather trite example of the ambiguities present if the above equation is stated with no additional clarification, consider the case where we declare arbitrarily that by definition $\beta_0 = b_0, \beta_1 = b_1,$ and $\varepsilon = Y - b_0 - b_1 X$ where $b_0, b_1$ are just random numbers we picked out of a hat. Defined in this way, any data we ever see would be (by our own declaration) consistent with the above model, but clearly, such playing around with symbols does not tell us much about the real world. So in order to make sense of what the above equation means, we have to be much more specific about the underlying model of the world we are committing to in deriving it. There are two common interpretations in econometrics

Interpretation 1: CEF

One interpretation of the above equation is that what you really mean by this is that $$E[Y | X] = \beta_0 + \beta_1 X$$ Note that the $\beta_1$ here does not necessarily have a causal interpretation. The concrete question we wish to answer when we posit such a model is "if I picked a random observation and saw that it had $X = x$, what would my best prediction of what $Y$ be". If this is the interpretation of the linear model, then $\varepsilon$ by definition becomes $Y - E[Y | X]$, and by definition of the CEF, $E[\varepsilon | X] = 0$, and hence $E[X\varepsilon] = E[X E[\varepsilon | X]] = 0$. This is what Angrist and Pischke (2008) mean when they say that "a regression inherits its legitimacy from a CEF". They mean that the OLS regression always has the interpretation as a linear approximation to the CEF. It happens to be the case that under some further assumptions about how $X$ is assigned (e.g. if $X$ is from an experiment), the CEF also happens to have a causal interpretation, but in general, we cannot necessarily equate the two. The core challenge of econometrics is precisely trying to map out what to do when such a coincidence cannot be justified by our underlying assumptions, which brings us to the second interpretation.

Interpretation 2: Causality

Often in econometrics, when we write down a model $$Y = \beta_0 + \beta_1 X + \varepsilon$$ what we really mean is: "if I could intervene and change all my $X$ values by some amount $\Delta X$, then I expect $Y$ to change by $\beta_1\Delta X$". Here, the operative word is intervene, and this question is subtly different than the question asked in defining the CEF. Angrist and Pischke (2008) is full of examples of this distinction, so I will not belabor the point here. The key takeaway though, is that in many ways, this question of intervention is much harder to answer. Specifically, by definition, intervention implies that we care about changing the world in a way that potentially makes it different from the world that generated the data. As a result, correlational relationships observed in the data are not necessarily representative of what would happen given the intervention. This is the problem IV aims to correct.

Let me now give an explicit (somewhat simplistic) causal model that would justify the usual IV estimates (I personally am not a big fan of using Assumption 1 below in actual work, but it is nice for illustrative purposes. Read Angrist and Pischke (2008)'s discussion about the LATE interpretation of IV for a somewhat less rigid interpretation of the IV slope).

Assumption 1: Treatment effects are constant, i.e. for each individual $i$, if they were randomly assigned $X = x$, their outcome $Y_i$ is given by $Y_i = \alpha_i + \beta_1 x$.

Essentially, this assumption says that everyone is affected exactly the same by an intervention that changed $X$ by some amount $\Delta X$, but there is heterogeneity in the baseline level of $Y_i$, as captured by $\alpha_i$

Assumption 2: $\mathrm{Cov}(X_i,\alpha_i) \neq 0$, but there is an instrument $Z_i$, such that $\mathrm{Cov}(Z_i,\alpha_i) = 0$ and $\mathrm{Cov}(Z_i,X_i)\neq 0$.

The first part of Assumption 2 implies that $X_i$ is not randomly assigned. Let us gain some intuition for what this means using a toy example. Perhaps $Y_i$ is a health outcome, so $\alpha_i$ reflects health with no medicine ($X_i = 0$). Then we are worried that people who take more medicine ($X_i$ is high) may at a baseline be less healthy ($\alpha_i$ low). The second and third parts of Assumption 2 imply that there is nonetheless some $Z_i$, which is as good as randomly assigned and induces some change in $X_i$. In the above example, one could imagine a scenario where $Z_i$ represents randomly giving some people coupons good only for buying the medicine.

I also use the following assumption below solely for the purposes of being able to explicitly compare the causal model to the CEF model:

Assumption 3: $E[\alpha_i | X_i] = \alpha_0 + \alpha_1 X_i$

Under Assumption 1, we can re-write things in terms of the linear model $$Y_i = \underbrace{\beta_0}_{E[\alpha_i]} + \beta_1 X_i + \underbrace{\varepsilon_i}_{\alpha_i - E[\alpha_i]}$$ But this is, as I alluded to above, not the only way we could have represented our data according using a linear model. In particular, Assumption 3 also allows us to write a linear model based on the CEF interpretation, i.e. $$Y_i = \underbrace{\gamma_0}_{\alpha_0} + \underbrace{\gamma_1}_{\beta_1 + \alpha_1} X_i + \underbrace{\delta_i}_{Y_i - \alpha_0 - (\beta_1 + \alpha_1) X_i}$$ Here, the use of $\gamma$'s and $\delta$ is meant to highlight that the intercept, slope, and error are distinct quantities from the intercept, slope, and error in the previous equation.

Superficially, these two models look quite similar, but in content, they are quite different. In particular, our $\mathrm{Cov}(X_i,\alpha_i) \neq 0$ assumption translates directly to $E[X_i\varepsilon_i] \neq 0$, while $E[X_i\delta_i] = 0$ by how we defined $\delta_i$. This first and foremost shows that Assumption 2 implies that the CEF does not have a causal interpretation. Translating to the medicine example, when I see that people who take more medicine tend to be sicker ($\gamma_1 > 0$), I do not conclude that the medicine caused ill health. The point of IV is that despite the fact that the CEF slope no longer tells us what $\beta_1$ is, we can still learn about $\beta_1$ if we go about it in a somewhat more clever way. Specifically, what we must instead do is to look at the people who got the drug coupon and people who did not get the drug coupon. If the people who got the coupon were observed to be healthier (which would happen if $\beta_1 < 0$), I could then plausibly conclude that the coupon induced higher medicine use, which helped subjects get better.

Finally, again using the medicine example, let me now justify why $\gamma_1$ and $\beta_1$ may both have their uses, but for different reasons. When I am interpreting $\gamma_1$ as the slope of a CEF, I am well equipped to answer the question "if I see the medicine on my friend's counter, should I be worried about their health?" while the second answers the question "if I know my friend is sick but that they aren't yet taking the medicine, should I recommend they do so?" Clearly, both can be important questions, but as the above discussion shows, even though they both can be written as linear models, the superficial similarities hide important interpretational distinctions, which only can be elucidated by writing down more explicitly your model of the world.