Endogeneity in Regression – How Does It Work and Is It Possible?

endogeneityregression

My question is not particularly straightforward. I will explain my reasoning, and then give the question at the end to avoid any confusion.

$\ Y_{i} = \beta_{0}+\beta_{1}X_{i}+\varepsilon_{i} $

Suppose that this our population regression function. We note that $\ \beta_{0}$ and $\ \beta_{1} $ are not arbitrary. This means that the error term $\ \varepsilon_{i} $ has some structure to it. In particular, it reflects the deviation of $ Y_{i} $ from its expected value. To put it even more clearly, we have that $ E[Y_{i}|X_{i}]= \beta_{0}+\beta_{1}X_{i} $. What I am trying to say here is that our population parameters must take some non-arbitrary value, otherwise we might as well define our errors to make the equation balance and pick them randomly.

Now suppose we want to take a sample and estimate the above population regression function. But before doing so, we want to work out what the true value of $ \beta_{0} $ and $ \beta_{1} $ are. To do this we appeal to the mean square error as our guide (I drop the subscripts for simplicity of notation):

$ MSE(b_{0}, b_{1}) = E[(Y – (b_{0}+b_{1}X))^2] = E[Y^2] -2b_{0}E[Y]-2b_{1}E[XY]+E[(b_{0}+b_{1}X)^2]$

$ MSE(b_{0}, b_{1}) = E[Y^2] -2b_{0}E[Y]-2b_{1}Cov(X,Y) – 2b_{1}E[X]E[Y]+b_{0}^2+2b_{0}b_{1}E[X]+b_{1}^2Var[X]+b_{1}^2(E[X])^2$

I have skipped some of the algebra because it is not really important at this point. Now I take first order conditions with respect to my two variables:

$ \frac{\partial MSE(b_{0},b_{1})}{\partial b_0}= -2E[Y]+2b_{0}+2b_{1}E[X]$

$ \frac{\partial MSE(b_{0},b_{1})}{\partial b_1}= -2Cov(X,Y)-2E[X]E[Y]+2b_{0}E[X]+2b_{1}Var[X]+2b_{1}(E[X])^2 $

Setting the partial derivates equal to zero and solving for each parameter gives:

$ \beta_{0}=E[Y]-\beta_{1}E[X] $

$ \beta_{1}= \frac{Cov[X,Y]}{Var[X]} $

Now that we have the equations of our true population parameters, we can consider the issue at hand, otherwise known as my question! Let's try to check the correlation between our explanatory variables and the error term defined above:

$\ E[X\varepsilon] = E[X(Y-(\beta_{0}+\beta_{1}X)] = E[XY – X\beta_{0}-\beta_{1}X^2] $

Using some simple algebraic manipulations we get that:

$ E[X\varepsilon]= E[XY]- E[X\beta_{0}+\beta_{1}X^2] $

$ E[X\varepsilon]= E[XY]- E[X(E[Y]-\beta_{1}E[X])+\beta_{1}X^2] $

$ E[X\varepsilon]= E[XY]- E[XE[Y]-X\beta_{1}E[X]+\beta_{1}X^2] $

$ E[X\varepsilon]= E[XY]- E[Y]E[X] + E[X\beta_{1}E[X]] – E[\beta_{1}X^2] $

$ E[X\varepsilon]= E[XY]- E[Y]E[X] + \beta_{1}E[X]^2 – E[\beta_{1}X^2] $

$ E[X\varepsilon]= E[XY]- E[Y]E[X] + \beta_{1}E[X]^2 – \beta_{1}E[X^2] $

$ E[X\varepsilon]= E[XY]- E[Y]E[X] -\beta_{1}(E[X^2] – E[X]^2) $

$ E[X\varepsilon]= Cov[X,Y]-\beta_{1}Var[X] $

$ E[X\varepsilon]= Cov[X,Y]-\frac{Cov[X,Y]}{Var[X]}Var[X] $

$ E[X\varepsilon]= 0 $

Now what we have proved is the by construction the parameters of the population regression function are defined in such a way that the covariance between the explanatory variables and errors is equal to zero.

Now bearing all of the above in mind, herein lies my question: how is endogeneity mathematically possible? Is it that my interpretation of the regression is wrong? I have heard others mention that there is some sort of causal interpretation of regression that must be taken, though I am not really sure what that means, or what that is. Any insights?

P.S. Thank you to those that have answered my other questions about regression analysis previously, I am learning a lot!

Best Answer

You are correct. In the linear regression $\mathbb{E}[X\varepsilon] = 0$ by construction. I talked about this in another post here

https://stats.stackexchange.com/a/550305/261146

Defining endogeneity requires additional structure.

  • You need an "externally defined" quantity of interest.
  • Based on this external definition you decide whether OLS satisfies exogeneity.

1. An illustrative example

I think the best example is easily selection into treatment, the "causal case".

Suppose that an individual is assigned to a treatment $D \in \{0,1\}$. If the individual is part of the control she will have an outcome $Y_0$. If she is treated, her outcome is $Y_1$. The outcome observed by the researcher is: \begin{align*} Y &= DY_1 + (1-D)Y_0 \\ &= Y_0 + D(Y_1 - Y_0) \end{align*}The variables $(Y_1,Y_0)$ are known as "potential outcomes". An individual cannot belong to both groups at the same time, which means that one of the outcomes is never observed. The best the researcher can do is find two groups with similar characteristics and different treatment status to compare their outcomes. The average treatment effect is defined as $$ \tau = \mathbb{E}[Y_1 - Y_0] $$ This quantity is important in medical trials and many situations because it indicates the population-level treatment effect.

Can we identify $\tau$ from an OLS regression of $Y$ on $D$?

2. Running an OLS estimator:

The OLS regression takes the form: $$ Y = \mu_{OLS} + \tau_{OLS} D + \varepsilon$$

The coefficient $(\mu_{OLS},\tau_{OLS})$ are constructed to minimize MSE, and will satisfy the exogeneity condition.

The OLS coefficient of $\tau$ is equal to \begin{align*} \tau_{OLS} &= \frac{\mathbb{E}[DY]}{\mathbb{E}[D]} - \frac{\mathbb{E}[(1-D)Y]}{\mathbb{E}[1-D]} \end{align*}

3. Does OLS work, i.e. $\tau_{OLS} = \tau$?

  • By substituting the definition of $Y$ with potential outcomes, this is equal to $$ \tau_{OLS} = \frac{\mathbb{E}[DY_1]}{\mathbb{E}[D]} - \frac{\mathbb{E}[(1-D)Y_0]}{\mathbb{E}[1-D]} $$ Case 1: (Random assignment) When $D$ is randomly assigned, $D$ is independent of $(Y_1,Y_0)$. This means that $\mathbb{E}[DY_1] = \mathbb{E}[D]\mathbb{E}[Y_1]$ and $\mathbb{E}[DY_0] = \mathbb{E}[D]\mathbb{E}[Y_0]$. We can then readily verify that $\tau_{OLS} = \mathbb{E}[Y_1] - \mathbb{E}[Y_0] = \tau$.

    Case 2: (Self-selection) Suppose that $D = 1$ if and only if $Y_1 > Y_0$. This is situation where individuals only participate if they (correctly) assess that they will receive some benefit. In this case the terms may not cancel out and $\tau_{OLS} \ne \tau$.

3. Insights

  • We can always define the potential outcomes as linear model:

$$ Y = \mu + \tau D + U $$ where $U = (Y_0 - \mu) + D(Y_1 - Y_0 - \tau)$ and $\mu = \mathbb{E}[Y_0]$.

  • The endogeneity condition in this case requires that

\begin{align*} \mathbb{E}[ D U ] &= \mathbb{E}[D(Y_1 - Y_0 - \tau)] = 0 \\ \mathbb{E}[ U ] &= \mathbb{E}[Y_0 - \mu + D(Y_1 - Y_0 - \tau)] = 0 \\ \end{align*}

  • This may fail if $D$ is correlated with $(Y_1,Y_0)$.
  • The OLS minimizes the MSE criterion. It is not guaranteed to produce anything else.
  • We sometimes call the parameters $(\mu_{OLS},\tau_{OLS})$ "pseudo-parameters" because they solve the equation $\mathbb{E}[X(Y - \mu_{OLS} - D \tau_{OLS})] = 0$, where $X = (1,D)$.
  • However, depending on the "external" model, these may or may not have a "useful" interpretation.