The short answer is yes, $\mathbb{E}[X_i\varepsilon_i] = 0$ by construction.
A few important remarks:
- However, this does not imply that $\mathbb{E}[\varepsilon_i \mid X_i] = 0$.
- The conditional mean equal to zero is a much stronger condition.
- Your logic fails because you are assuming that conditional means are always linear.
- What about endogeneity?
- Statistically the OLS $\beta$ always satisfies $\mathbb{E}[X_i\varepsilon_i] = 0$.
- Modern theory calls this a "pseudo-parameter", that "mechanically" satisfies the restriction.
- There are many examples where we can formally define an external notion of a "true" $\beta^*$, and show that things like sample selection, missing variables, etc., can lead to situations where $\beta \ne \beta^*$.This requires more structure to the problem to even define the bias.
The way that you are looking at the problem is slightly circular. Let us break down the problem by looking at some definitions.
Setting up the problem
The OLS estimator in matrix form is $\beta = (X'X)^{-1}X'Y$. It can also be written as follows:
$$ \hat{\beta} = \left( \frac{1}{n} \sum_{i=1}^n X_i X_i'\right)^{-1}\left( \frac{1}{n} \sum_{i=1}^n X_i Y_i \right) $$
The OLS is an approximation to the following population quantity
$$ \beta = \mathbb{E}[X_iX_i']^{-1}\mathbb{E}[X_iY_i] $$
This definition is convenient because it doesn't introduce any assumptions about the error terms. It just says that $\beta$ is a function of two means.
Why $\mathbb{E}[X_i\varepsilon_i] = 0$ by construction?
Now let us define the error terms as
$$ \varepsilon_i = Y_i - X_i'\beta $$
Then we can verify that
\begin{align*} \mathbb{E}[X_i\varepsilon_i] &= \mathbb{E}[X_iY_i-X_iX_i' \beta] \\ &=\mathbb{E}[X_iY_i] - \mathbb{E}[X_iX_i'] \beta
\end{align*}
Plug-in the definition of $\beta$ above and you can easily check that $\mathbb{E}[X_i\varepsilon_i] = 0$. A similar result holds in finite sample for $\hat{\beta}$. In particular
$$ \frac{1}{n}\sum_{i=1}^2 X_i(Y_i - X_i'\hat{\beta}) = \left(\frac{1}{n}\sum_{i=1}^2 X_iY_i\right) - \left(\frac{1}{n}\sum_{i=1}^2 X_iX_i' \right)\hat{\beta}$$
This can also be written in matrix form as $\hat{\varepsilon} = Y - X\hat{\beta}$, and $\frac{X'\hat{\varepsilon}}{n} = 0$.
Why $\mathbb{E}[\varepsilon_i \mid X_i]$ may not equal zero?
So far we have not made any assumptions about the dependence between $(Y_i,X_i)$. Now consider a general model where
$$Y_i = g(X_i) + U_i, \qquad \mathbb{E}[U_i \mid X] = 0$$
where $U_i$ is a new variable that we are introducing. The function $g(X_i)$ is the conditional mean of $Y$. Now let us apply the definition of the OLS residual to this model:
\begin{align*}
\varepsilon_i &= Y_i - X_i'\beta \\
&= g(X_i) + U_i - X_i'\beta \\
&= [g(X_i) - X_i'\beta] + U_i
\end{align*}
Now let us compute the conditional expectation
\begin{align*}
\mathbb{E}[\varepsilon_i \mid X_i] &= \mathbb{E}[g(X_i) - X_i'\beta \mid X_i] - \mathbb{E}[U_i \mid X_i] \\
&= g(X_i) - X_i'\beta
\end{align*}
In the second line we use the fact that the conditional mean of $U_i$ is zero, and $g(X_i) - X_i'\beta$ is a function of $X_i$. Once you state it in this form it is easy to find counterexamples of when the conditional mean is non-zero, regardless of the value of $\beta$. For instance, assume that $g(X_i) = X_i^2$ and $X_i \sim \mathcal{N}(0,1)$.
- In this particular case, we can prove that the OLS coefficient is defined as $\beta = \mathbb{E}[X_iX_i']^{-1}\mathbb{E}[X_ig(X_i)]$, by applying the law of iterated expectations.
- As you can see, everything is internally consistent.
- In a nonlinear model like this, $X_i'\beta$ is called the "best linear projection" because it minimizes the residuals under the linearity constraint.
The model you wrote down was (I am excluding $X_2$ since it is not obvious that it plays any role in any of the discussion, but modify your question to clarify if $X_2$ is important to you)
$$Y = \beta_0 + \beta_1 X + \varepsilon$$
The confusion, I believe, stems from the fact that the model as written contains ambiguities. For a rather trite example of the ambiguities present if the above equation is stated with no additional clarification, consider the case where we declare arbitrarily that by definition $\beta_0 = b_0, \beta_1 = b_1,$ and $\varepsilon = Y - b_0 - b_1 X$ where $b_0, b_1$ are just random numbers we picked out of a hat. Defined in this way, any data we ever see would be (by our own declaration) consistent with the above model, but clearly, such playing around with symbols does not tell us much about the real world. So in order to make sense of what the above equation means, we have to be much more specific about the underlying model of the world we are committing to in deriving it. There are two common interpretations in econometrics
Interpretation 1: CEF
One interpretation of the above equation is that what you really mean by this is that
$$E[Y | X] = \beta_0 + \beta_1 X$$
Note that the $\beta_1$ here does not necessarily have a causal interpretation. The concrete question we wish to answer when we posit such a model is "if I picked a random observation and saw that it had $X = x$, what would my best prediction of what $Y$ be". If this is the interpretation of the linear model, then $\varepsilon$ by definition becomes $Y - E[Y | X]$, and by definition of the CEF, $E[\varepsilon | X] = 0$, and hence $E[X\varepsilon] = E[X E[\varepsilon | X]] = 0$. This is what Angrist and Pischke (2008) mean when they say that "a regression inherits its legitimacy from a CEF". They mean that the OLS regression always has the interpretation as a linear approximation to the CEF. It happens to be the case that under some further assumptions about how $X$ is assigned (e.g. if $X$ is from an experiment), the CEF also happens to have a causal interpretation, but in general, we cannot necessarily equate the two. The core challenge of econometrics is precisely trying to map out what to do when such a coincidence cannot be justified by our underlying assumptions, which brings us to the second interpretation.
Interpretation 2: Causality
Often in econometrics, when we write down a model
$$Y = \beta_0 + \beta_1 X + \varepsilon$$
what we really mean is: "if I could intervene and change all my $X$ values by some amount $\Delta X$, then I expect $Y$ to change by $\beta_1\Delta X$". Here, the operative word is intervene, and this question is subtly different than the question asked in defining the CEF. Angrist and Pischke (2008) is full of examples of this distinction, so I will not belabor the point here. The key takeaway though, is that in many ways, this question of intervention is much harder to answer. Specifically, by definition, intervention implies that we care about changing the world in a way that potentially makes it different from the world that generated the data. As a result, correlational relationships observed in the data are not necessarily representative of what would happen given the intervention. This is the problem IV aims to correct.
Let me now give an explicit (somewhat simplistic) causal model that would justify the usual IV estimates (I personally am not a big fan of using Assumption 1 below in actual work, but it is nice for illustrative purposes. Read Angrist and Pischke (2008)'s discussion about the LATE interpretation of IV for a somewhat less rigid interpretation of the IV slope).
Assumption 1: Treatment effects are constant, i.e. for each individual $i$, if they were randomly assigned $X = x$, their outcome $Y_i$ is given by $Y_i = \alpha_i + \beta_1 x$.
Essentially, this assumption says that everyone is affected exactly the same by an intervention that changed $X$ by some amount $\Delta X$, but there is heterogeneity in the baseline level of $Y_i$, as captured by $\alpha_i$
Assumption 2: $\mathrm{Cov}(X_i,\alpha_i) \neq 0$, but there is an instrument $Z_i$, such that $\mathrm{Cov}(Z_i,\alpha_i) = 0$ and $\mathrm{Cov}(Z_i,X_i)\neq 0$.
The first part of Assumption 2 implies that $X_i$ is not randomly assigned. Let us gain some intuition for what this means using a toy example. Perhaps $Y_i$ is a health outcome, so $\alpha_i$ reflects health with no medicine ($X_i = 0$). Then we are worried that people who take more medicine ($X_i$ is high) may at a baseline be less healthy ($\alpha_i$ low). The second and third parts of Assumption 2 imply that there is nonetheless some $Z_i$, which is as good as randomly assigned and induces some change in $X_i$. In the above example, one could imagine a scenario where $Z_i$ represents randomly giving some people coupons good only for buying the medicine.
I also use the following assumption below solely for the purposes of being able to explicitly compare the causal model to the CEF model:
Assumption 3: $E[\alpha_i | X_i] = \alpha_0 + \alpha_1 X_i$
Under Assumption 1, we can re-write things in terms of the linear model
$$Y_i = \underbrace{\beta_0}_{E[\alpha_i]} + \beta_1 X_i + \underbrace{\varepsilon_i}_{\alpha_i - E[\alpha_i]}$$
But this is, as I alluded to above, not the only way we could have represented our data according using a linear model. In particular, Assumption 3 also allows us to write a linear model based on the CEF interpretation, i.e.
$$Y_i = \underbrace{\gamma_0}_{\alpha_0} + \underbrace{\gamma_1}_{\beta_1 + \alpha_1} X_i + \underbrace{\delta_i}_{Y_i - \alpha_0 - (\beta_1 + \alpha_1) X_i}$$
Here, the use of $\gamma$'s and $\delta$ is meant to highlight that the intercept, slope, and error are distinct quantities from the intercept, slope, and error in the previous equation.
Superficially, these two models look quite similar, but in content, they are quite different. In particular, our $\mathrm{Cov}(X_i,\alpha_i) \neq 0$ assumption translates directly to $E[X_i\varepsilon_i] \neq 0$, while $E[X_i\delta_i] = 0$ by how we defined $\delta_i$. This first and foremost shows that Assumption 2 implies that the CEF does not have a causal interpretation. Translating to the medicine example, when I see that people who take more medicine tend to be sicker ($\gamma_1 > 0$), I do not conclude that the medicine caused ill health. The point of IV is that despite the fact that the CEF slope no longer tells us what $\beta_1$ is, we can still learn about $\beta_1$ if we go about it in a somewhat more clever way. Specifically, what we must instead do is to look at the people who got the drug coupon and people who did not get the drug coupon. If the people who got the coupon were observed to be healthier (which would happen if $\beta_1 < 0$), I could then plausibly conclude that the coupon induced higher medicine use, which helped subjects get better.
Finally, again using the medicine example, let me now justify why $\gamma_1$ and $\beta_1$ may both have their uses, but for different reasons. When I am interpreting $\gamma_1$ as the slope of a CEF, I am well equipped to answer the question "if I see the medicine on my friend's counter, should I be worried about their health?" while the second answers the question "if I know my friend is sick but that they aren't yet taking the medicine, should I recommend they do so?" Clearly, both can be important questions, but as the above discussion shows, even though they both can be written as linear models, the superficial similarities hide important interpretational distinctions, which only can be elucidated by writing down more explicitly your model of the world.
Best Answer
You are not far from to the correct interpretation. Many times presentations are ambiguous and/or contradictory about this point, but all problems come to solve if we keep clear separated CEF and causal model as you do.
More precisely the mean independence is precisely true if the exact CEF is linear, if you use the approximation argument even this condition is true in approximation only. For simplification let me said that the exact CEF is assumed linear.
This part is ok, but a precisation about the word "coincide" can help. Even if algebraically the two equation can coincide we must keep in mind that them are two strongly different concepts; them must be not mixed.
Exactly.
Yes. In this situation the previous CEF do not permits to identify the causal parameters.
Indeed you have that $ E[\hat{\beta}] = \beta$ regardless endogeneity problems.
Endogeneity imply the biasedness of $ \hat{\beta} $ but it is biased for what? It is biased for $\delta$.
Indeed
Exactly.
IV estimator can help for consistent estimation of $\delta$. However even OLS can be used yet, what you need is good controls.
Finally this topic is quite vast, this my explanation can help: Under which assumptions a regression can be interpreted causally?