Regression – Can the Zero Conditional Mean Assumption Ever Fail?

endogeneityleast squaresregression

I have a questions about the so-called "zero conditional mean" assumption often made in the context of regression analysis. I am struggling to see how it could be violated, or rather where it is violated.

$ E[\hat \beta] = E[\left(\mathbf X' \mathbf X\right)^{-1}\mathbf X' \mathbf Y] = \beta + E[\left(\mathbf X' \mathbf X\right)^{-1}\mathbf X' \mathbf \varepsilon] $

Above, I give the expectation of the OLS estimate in matrix form. My problem is that I do not see how it is possible for $\ E[X' \varepsilon]$ to not equal zero. Now before you all give me examples from the literature on endogeneity of when there is this bias, the reason I have trouble with this is because whenever we formulate our population regression function as a conditional expectation, by definition the $E[\ X' \varepsilon] =0 $. In other words, in any population regression function defined as a conditional expectation, we have that the errors are uncorrelated with our regressors. This is because their correlation is defined to be zero. To put it formally (and by the Law of Iterated Expectations):

$ \ E[Y|X]=B_{0}+B_{1}X_{i} \Rightarrow Y_{i}=E[Y|X]+\varepsilon _{i}\Rightarrow E[Y|X_{i}]=E[Y|X]+E[\varepsilon|X ]\Rightarrow E[\varepsilon_{i}|X]=0\ $

But this raises the question: what is the correlation term in the OLS estimates referring to, and how can it be that the assumption does not hold?

EDIT: Some of the comments have suggested that I am confusing errors with residuals. However, this is not obviously the case. My point is as follows. When running a regression, we use OLS to tell us about: $ \ E[Y|X]$. However, for any way in which I specify $ \ E[Y|X]$ it seems as though I am mathematically committing myself to $\ E[\varepsilon_{i}|X]=0\ $. Could someone perhaps specify a population regression function (as a conditional expectation) where this is not the case?

For example, if we consider $\ E[Y|X] = B_{0}+B_{1}X_{i}+B_{2}X_{i} $ and we suppose that $ E[\varepsilon_{i}|X] = g(x) $. Then, by iterated expectations we get that:

$\ Y_{i}=B_{0}+B_{1}X_{i}+B_{2}X_{i}+\varepsilon_{i} $

$\ E[Y|X]=E[B_{0}+B_{1}X_{i}+B_{2}X_{i}|X] + E[\varepsilon|X] $

$\ E[Y|X]=E[Y|X] + E[\varepsilon|X] $

$\ E[Y|X]=E[Y|X] +g(x) $

$\ g(x)=0$

If $\ g(x) $ does not equal zero, then we have a mathematical contradiction, and therefore I have no idea how in the OLS estimates, we can have anything other than $ \ E[X' \varepsilon=0]\ $.

EDIT 2:

It seems I have found a potential solution to my problem:

CEF: $\ E[Y|X]=B_{0}+B_{1}X_{i}+B_{2}X_{i} $

Regression $\ Y_{i}= B_{0}+B_{1}X_{i}+B_{2}X_{i} + \varepsilon_{i} $

The idea is that $\ E[X|\varepsilon]=0 $ must hold for the CEF, but it may not hold for the regression function. Is this on the right track?

Best Answer

The short answer is yes, $\mathbb{E}[X_i\varepsilon_i] = 0$ by construction.

A few important remarks:

  • However, this does not imply that $\mathbb{E}[\varepsilon_i \mid X_i] = 0$.
  • The conditional mean equal to zero is a much stronger condition.
  • Your logic fails because you are assuming that conditional means are always linear.
  • What about endogeneity?
    • Statistically the OLS $\beta$ always satisfies $\mathbb{E}[X_i\varepsilon_i] = 0$.
    • Modern theory calls this a "pseudo-parameter", that "mechanically" satisfies the restriction.
    • There are many examples where we can formally define an external notion of a "true" $\beta^*$, and show that things like sample selection, missing variables, etc., can lead to situations where $\beta \ne \beta^*$.This requires more structure to the problem to even define the bias.

The way that you are looking at the problem is slightly circular. Let us break down the problem by looking at some definitions.

Setting up the problem

The OLS estimator in matrix form is $\beta = (X'X)^{-1}X'Y$. It can also be written as follows: $$ \hat{\beta} = \left( \frac{1}{n} \sum_{i=1}^n X_i X_i'\right)^{-1}\left( \frac{1}{n} \sum_{i=1}^n X_i Y_i \right) $$ The OLS is an approximation to the following population quantity $$ \beta = \mathbb{E}[X_iX_i']^{-1}\mathbb{E}[X_iY_i] $$ This definition is convenient because it doesn't introduce any assumptions about the error terms. It just says that $\beta$ is a function of two means.

Why $\mathbb{E}[X_i\varepsilon_i] = 0$ by construction?

Now let us define the error terms as $$ \varepsilon_i = Y_i - X_i'\beta $$ Then we can verify that \begin{align*} \mathbb{E}[X_i\varepsilon_i] &= \mathbb{E}[X_iY_i-X_iX_i' \beta] \\ &=\mathbb{E}[X_iY_i] - \mathbb{E}[X_iX_i'] \beta \end{align*} Plug-in the definition of $\beta$ above and you can easily check that $\mathbb{E}[X_i\varepsilon_i] = 0$. A similar result holds in finite sample for $\hat{\beta}$. In particular $$ \frac{1}{n}\sum_{i=1}^2 X_i(Y_i - X_i'\hat{\beta}) = \left(\frac{1}{n}\sum_{i=1}^2 X_iY_i\right) - \left(\frac{1}{n}\sum_{i=1}^2 X_iX_i' \right)\hat{\beta}$$ This can also be written in matrix form as $\hat{\varepsilon} = Y - X\hat{\beta}$, and $\frac{X'\hat{\varepsilon}}{n} = 0$.

Why $\mathbb{E}[\varepsilon_i \mid X_i]$ may not equal zero?

So far we have not made any assumptions about the dependence between $(Y_i,X_i)$. Now consider a general model where $$Y_i = g(X_i) + U_i, \qquad \mathbb{E}[U_i \mid X] = 0$$ where $U_i$ is a new variable that we are introducing. The function $g(X_i)$ is the conditional mean of $Y$. Now let us apply the definition of the OLS residual to this model: \begin{align*} \varepsilon_i &= Y_i - X_i'\beta \\ &= g(X_i) + U_i - X_i'\beta \\ &= [g(X_i) - X_i'\beta] + U_i \end{align*} Now let us compute the conditional expectation \begin{align*} \mathbb{E}[\varepsilon_i \mid X_i] &= \mathbb{E}[g(X_i) - X_i'\beta \mid X_i] - \mathbb{E}[U_i \mid X_i] \\ &= g(X_i) - X_i'\beta \end{align*} In the second line we use the fact that the conditional mean of $U_i$ is zero, and $g(X_i) - X_i'\beta$ is a function of $X_i$. Once you state it in this form it is easy to find counterexamples of when the conditional mean is non-zero, regardless of the value of $\beta$. For instance, assume that $g(X_i) = X_i^2$ and $X_i \sim \mathcal{N}(0,1)$.

  • In this particular case, we can prove that the OLS coefficient is defined as $\beta = \mathbb{E}[X_iX_i']^{-1}\mathbb{E}[X_ig(X_i)]$, by applying the law of iterated expectations.
  • As you can see, everything is internally consistent.
  • In a nonlinear model like this, $X_i'\beta$ is called the "best linear projection" because it minimizes the residuals under the linearity constraint.