My question is not particularly straightforward. I will explain my reasoning, and then give the question at the end to avoid any confusion.
$\ Y_{i} = \beta_{0}+\beta_{1}X_{i}+\varepsilon_{i} $
Suppose that this our population regression function. We note that $\ \beta_{0}$ and $\ \beta_{1} $ are not arbitrary. This means that the error term $\ \varepsilon_{i} $ has some structure to it. In particular, it reflects the deviation of $ Y_{i} $ from its expected value. To put it even more clearly, we have that $ E[Y_{i}|X_{i}]= \beta_{0}+\beta_{1}X_{i} $. What I am trying to say here is that our population parameters must take some non-arbitrary value, otherwise we might as well define our errors to make the equation balance and pick them randomly.
Now suppose we want to take a sample and estimate the above population regression function. But before doing so, we want to work out what the true value of $ \beta_{0} $ and $ \beta_{1} $ are. To do this we appeal to the mean square error as our guide (I drop the subscripts for simplicity of notation):
$ MSE(b_{0}, b_{1}) = E[(Y – (b_{0}+b_{1}X))^2] = E[Y^2] -2b_{0}E[Y]-2b_{1}E[XY]+E[(b_{0}+b_{1}X)^2]$
$ MSE(b_{0}, b_{1}) = E[Y^2] -2b_{0}E[Y]-2b_{1}Cov(X,Y) – 2b_{1}E[X]E[Y]+b_{0}^2+2b_{0}b_{1}E[X]+b_{1}^2Var[X]+b_{1}^2(E[X])^2$
I have skipped some of the algebra because it is not really important at this point. Now I take first order conditions with respect to my two variables:
$ \frac{\partial MSE(b_{0},b_{1})}{\partial b_0}= -2E[Y]+2b_{0}+2b_{1}E[X]$
$ \frac{\partial MSE(b_{0},b_{1})}{\partial b_1}= -2Cov(X,Y)-2E[X]E[Y]+2b_{0}E[X]+2b_{1}Var[X]+2b_{1}(E[X])^2 $
Setting the partial derivates equal to zero and solving for each parameter gives:
$ \beta_{0}=E[Y]-\beta_{1}E[X] $
$ \beta_{1}= \frac{Cov[X,Y]}{Var[X]} $
Now that we have the equations of our true population parameters, we can consider the issue at hand, otherwise known as my question! Let's try to check the correlation between our explanatory variables and the error term defined above:
$\ E[X\varepsilon] = E[X(Y-(\beta_{0}+\beta_{1}X)] = E[XY – X\beta_{0}-\beta_{1}X^2] $
Using some simple algebraic manipulations we get that:
$ E[X\varepsilon]= E[XY]- E[X\beta_{0}+\beta_{1}X^2] $
$ E[X\varepsilon]= E[XY]- E[X(E[Y]-\beta_{1}E[X])+\beta_{1}X^2] $
$ E[X\varepsilon]= E[XY]- E[XE[Y]-X\beta_{1}E[X]+\beta_{1}X^2] $
$ E[X\varepsilon]= E[XY]- E[Y]E[X] + E[X\beta_{1}E[X]] – E[\beta_{1}X^2] $
$ E[X\varepsilon]= E[XY]- E[Y]E[X] + \beta_{1}E[X]^2 – E[\beta_{1}X^2] $
$ E[X\varepsilon]= E[XY]- E[Y]E[X] + \beta_{1}E[X]^2 – \beta_{1}E[X^2] $
$ E[X\varepsilon]= E[XY]- E[Y]E[X] -\beta_{1}(E[X^2] – E[X]^2) $
$ E[X\varepsilon]= Cov[X,Y]-\beta_{1}Var[X] $
$ E[X\varepsilon]= Cov[X,Y]-\frac{Cov[X,Y]}{Var[X]}Var[X] $
$ E[X\varepsilon]= 0 $
Now what we have proved is the by construction the parameters of the population regression function are defined in such a way that the covariance between the explanatory variables and errors is equal to zero.
Now bearing all of the above in mind, herein lies my question: how is endogeneity mathematically possible? Is it that my interpretation of the regression is wrong? I have heard others mention that there is some sort of causal interpretation of regression that must be taken, though I am not really sure what that means, or what that is. Any insights?
P.S. Thank you to those that have answered my other questions about regression analysis previously, I am learning a lot!
Best Answer
You are correct. In the linear regression $\mathbb{E}[X\varepsilon] = 0$ by construction. I talked about this in another post here
https://stats.stackexchange.com/a/550305/261146
Defining endogeneity requires additional structure.
1. An illustrative example
I think the best example is easily selection into treatment, the "causal case".
Suppose that an individual is assigned to a treatment $D \in \{0,1\}$. If the individual is part of the control she will have an outcome $Y_0$. If she is treated, her outcome is $Y_1$. The outcome observed by the researcher is: \begin{align*} Y &= DY_1 + (1-D)Y_0 \\ &= Y_0 + D(Y_1 - Y_0) \end{align*}The variables $(Y_1,Y_0)$ are known as "potential outcomes". An individual cannot belong to both groups at the same time, which means that one of the outcomes is never observed. The best the researcher can do is find two groups with similar characteristics and different treatment status to compare their outcomes. The average treatment effect is defined as $$ \tau = \mathbb{E}[Y_1 - Y_0] $$ This quantity is important in medical trials and many situations because it indicates the population-level treatment effect.
Can we identify $\tau$ from an OLS regression of $Y$ on $D$?
2. Running an OLS estimator:
The OLS regression takes the form: $$ Y = \mu_{OLS} + \tau_{OLS} D + \varepsilon$$
The coefficient $(\mu_{OLS},\tau_{OLS})$ are constructed to minimize MSE, and will satisfy the exogeneity condition.
The OLS coefficient of $\tau$ is equal to \begin{align*} \tau_{OLS} &= \frac{\mathbb{E}[DY]}{\mathbb{E}[D]} - \frac{\mathbb{E}[(1-D)Y]}{\mathbb{E}[1-D]} \end{align*}
3. Does OLS work, i.e. $\tau_{OLS} = \tau$?
By substituting the definition of $Y$ with potential outcomes, this is equal to $$ \tau_{OLS} = \frac{\mathbb{E}[DY_1]}{\mathbb{E}[D]} - \frac{\mathbb{E}[(1-D)Y_0]}{\mathbb{E}[1-D]} $$ Case 1: (Random assignment) When $D$ is randomly assigned, $D$ is independent of $(Y_1,Y_0)$. This means that $\mathbb{E}[DY_1] = \mathbb{E}[D]\mathbb{E}[Y_1]$ and $\mathbb{E}[DY_0] = \mathbb{E}[D]\mathbb{E}[Y_0]$. We can then readily verify that $\tau_{OLS} = \mathbb{E}[Y_1] - \mathbb{E}[Y_0] = \tau$.
Case 2: (Self-selection) Suppose that $D = 1$ if and only if $Y_1 > Y_0$. This is situation where individuals only participate if they (correctly) assess that they will receive some benefit. In this case the terms may not cancel out and $\tau_{OLS} \ne \tau$.
3. Insights
$$ Y = \mu + \tau D + U $$ where $U = (Y_0 - \mu) + D(Y_1 - Y_0 - \tau)$ and $\mu = \mathbb{E}[Y_0]$.
\begin{align*} \mathbb{E}[ D U ] &= \mathbb{E}[D(Y_1 - Y_0 - \tau)] = 0 \\ \mathbb{E}[ U ] &= \mathbb{E}[Y_0 - \mu + D(Y_1 - Y_0 - \tau)] = 0 \\ \end{align*}