Solved – Conditional mean independence implies unbiasedness and consistency of the OLS estimator

econometricsleast squaresmultiple regressionnonlinear regressionregression

Consider the following multiple regression model: $$Y=X\beta+Z\delta+U.\tag{1}$$

Here $Y$ is a $n\times 1$ column vector; $X$ a $n\times (k+1)$ matrix; $\beta$ a $(k+1)\times 1$ column vector; $Z$ a $n\times l$ matrix; $\delta$ a $l\times 1$ column vector; and $U$, the error term, a $n\times1$ column vector.


QUESTION

My lecturer, the textbook Introduction to Econometrics, 3rd ed.
by James H. Stock and Mark W. Watson, p. 281, and Econometrics: Honor's Exam Review Session (PDF), p. 7, has expressed the following to me.

  1. If we assume what is called conditional mean independence, which by
    definition means that $$E(U|X,Z)=E(U|Z),\tag{2}$$
  2. and if the least squares assumption are satisfied except the conditional mean zero assumption $E(U|X,Z)=0$ (so we assume $E(U|X,Z)=E(U|Z) \neq 0$)(see 1-3 below),

  3. then, the OLS
    estimator $\hat{\beta}$ of $\beta$ in $(1)$ remains unbiased and consistent, under this weaker set of assumptions.

How do I prove this proposition? I.e., that 1 and 2 above implies that the OLS estimate of $\beta$ gives us an unbiased and consistent estimator for $\beta$? Is there any research article proving this proposition?


COMMENT

The simplest case is given by considering the linear regression model $$Y_i=\beta_0+\beta_1X_i+\beta_2Z_i+u_i,\quad i=1,2,\ldots,n,$$ and prove that the OLS estimate $\hat{\beta}_1$ of $\beta_1$ is unbiased if $E(u_i|X_i,Z_i)=E(u_i|Z_i)$ for each $i$.

PROOF OF UNBIASEDNESS ASSUMING THAT $U_i$ AND $Z_i$ ARE JOINTLY NORMALLY DISTRIBUTED

Define $V=U-E(U|X,Z)$, then $U=V+E(U|X,Z)$ and $$E(V|X,Z)=0\tag{*}.$$ Thus $(1)$ may be rewritten as $$Y=X\beta+Z\delta+E(U|X,Z)+V.\tag{3}$$ By $(2)$ it then follows that $$Y=X\beta+Z\delta+E(U|Z)+V.\tag{4}$$ Now, since $U_i$ and $Z_i$ are jointly normally distributed, the theory of normal distributions, cf. Deriving the conditional distributions of a multivariate normal distribution, says that (indeed, we do not need to assume joint normality but only this identity) $$E(U|Z)=Z\gamma\tag{**}$$ for some $l$ by $1$ vector $\gamma\neq\textbf{0}$.

Now $(4)$ becomes $$Y=X\beta+Z(\delta+\gamma)+V.\tag{5}$$ For the model $(5)$ all the least squares assumption are satisfied, as the error term $V$ satisfies the assumption of conditional mean zero. This implies that the OLS estimate $\hat{\beta}$ of $\beta$ will be unbiased, for if we let $\rho=\delta+\gamma$, and let $W=(X,Z)$ be the $n$ by $(k+1)+l$ matrix composed of $X$ and $Z$, then the OLS estimate of $\beta$ in $(5)$ is given by considering the following: \begin{align}(\hat{\beta}^T,\hat{\rho}^T)^T &=(W^TW)^{-1}W^TY\\
&=(W^TW)^{-1}W^T(W(\beta^T,\rho^T)^T+V)\\
&=(\beta^T,\rho^T)^T+(W^TW)^{-1}W^TV\end{align}

and thus \begin{align}E((\hat{\beta}^T,\hat{\rho}^T)^T|W)&=(\beta^T,\rho^T)^T+(W^TW)^{-1}Ws^TE(V|W)\\&=(\beta^T,\rho^T)^T+(W^TW)^{-1}W^T\textbf{0}\\&=(\beta^T,\rho^T)^T,\end{align} where the second line follows by $(*)$. Thus $\hat{\beta}$ is a conditionally unbiased estimate of $\beta$ since the OLS estimate given for model $(1)$ coinicides with that given for model $(5)$. Now, by the law of total expectation \begin{align}E(\hat{\beta})&=E(E(\hat{\beta}|W))\\ &=E(\beta)\\ &=\beta,\end{align} and thus $\hat{\beta}$ is an unbiased estimator for $\beta$.

(One may note that $E(\hat{\rho})=\rho=\delta+\gamma\neq\delta$, so that the coefficient on $Z$ is not necessarily unbiased.)

However, the special case above assumes that $U_i$ and $Z_i$ are jointly normally distributed, how do I prove the proposition without this assumption?

Assuming that $E(U|Z)=Z\gamma$ suffices always of course (cf. $(**)$), but I am supposed to derive the result just using $(2)$ and the least squares assumption excluding the Conditional Mean Zero assumption (see below).

REGARDING CONSISTENCY

I think one can also see that the estimate $\hat{\beta}$ is consistent for $\beta$ by noticing that in the regression model $(5)$ all least squares assumption are satisfied, including the assumption that the (new) error term $V$ satisfies the Conditional Mean Zero assumption (cf. $(*)$ and see below).

I may add a proof of consistency later on which is based on a series of exercises in Introduction to Econometrics, 3rd ed. by James H. Stock and Mark W. Watson, ch. 18. However, this proof is quite long. But the point here is that the proof provided in the exercises assumes $(**)$, so I am still wondering whether the assumption $(2)$ really suffices.

SUBQUERY 1

In Introduction to Econometrics, 3rd ed. by James H. Stock and Mark W. Watson, it is said, at p. 300, that the assumption $(**)$ can be "relaxed" using the theory of nonlinear regression. What do or may they mean by this?

THE LEAST SQUARES ASSUMPTIONS

Here I exclude the conditional mean zero assumption that $E(U|X,Z)=0$ since the proposition which we try to prove here allows for cases where $E(U|X,Z)\neq 0$. These are e.g. cases when $Z$ is correlated with $U$. Cf. Econometrics: Honor's Exam Review Session (PDF), p. 7.

The least squares assumption are the following.

  1. The joint distributions of $(Y_i,X_i,Z_i)$, $i=1,2,\ldots,n,$ are i.i.d., where $Y_i$ is the $i$:th element in $Y$ and where $X_i$ and $Z_i$ are the $i$:th row vectors in $X$ and $Z$.

  2. Large outliers are unlikely, i.e., for each $i$, $X_i, Z_i$ and $U_i$ have finite fourth moments, where $U_i$ is the $i$:th element in $U$.

  3. $(X,Z)$ has full column rank (i.e., there is no perfect multicollinearity; this ensures the invertibility of $W^TW$).

  4. (Extended least squares assumptions: While I do not think this is necessary (and it has been said to me that it is not), we may also assume homoskedasticity, i.e. $\text{Var}(U_i|X_i,Z_i)=\sigma^2_U$ for each $i$, and that the conditional distribution of $U_i$ given $(X_i,Z_i)$ is normal for each $i$ (i.e., we have normal errors.))

NOTE ON TERMINOLOGY

In $(1)$, the Conditional Mean Zero assumption is the assumption that $E(U|X,Z)=0$. The Conditional Mean Independence assumption, however, is the assumption that $E(U|X,Z)=E(U|Z)$.

This terminology is used in e.g. Introduction to Econometrics, 3rd ed. by James H. Stock and Mark W. Watson, p. 281; and Econometric Analysis of Cross Section and Panel Data, 1st ed. by Jeffrey M. Wooldridge, p. 607. See also Conditional Independence Restrictions: Testing and Estimation for similar discussions.

ADDITIONAL THOUGHTS AND SUBQUERY 2

I think contrary to James H. Stock and Mark W. Watson that conditional mean independence does not ensure an unbiased OLS estimate of $\beta$. This is because $E(U|Z)$ may take on nonlinear forms like $E(U|Z)=p(Z)$ where $p(Z)$ is a polynomial in $Z$, or $E(U|Z)=\exp( Z\gamma)$ where $\gamma$ is some parameter yet to be estimated (here I am using the matrix exponential), and then, I think, nonlinear regression has to be applied, which generally leaves us with biased estimates. Also, the OLS estimate in (1) of $\beta$ may not even coincide with the OLS estimate of $\beta$ in $(4)$ if $E(U|Z)$ takes on certain nonlinear forms. (Psychologically I also feel that the statement made in the book by Stock & Watson is too good to be true.)

Thus, an additional question is if there is some counterexample to the proposition that conditional mean independence leads to an unbiased OLS estimate?

SUBQUERY 3

In Mostly Harmless Econometrics Angrist & Pischke argues in subsection 3.3, p. 68–91, that under conditional independence (CI), i.e. $Y$ being independent of $X$ given $W$ (which is a stronger condition, I guess, than the conditional mean independence assumption given above), there is a tight connection between matching estimates of the effect of $X$ on $Y$ and the coefficients on $X$ in the regression of $Y$ on $X$ and $W$ which motivates that under CI the OLS estimate of the coefficient on $X$ in $(1)$ is less biased than if CI does not hold (all else equal).

Now, can this idea somehow be used to answer my main question here?

Best Answer

It's false. As you observe, if you read Stock and Watson closely, they don't actually endorse the claim that OLS is unbiased for $\beta$ under conditional mean independence. They endorse the much weaker claim that OLS is unbiased for $\beta$ if $E(u|x,z)=z\gamma$. Then, they say something vague about non-linear least squares.

Your equation (4) contains what you need to see that the claim is false. Estimating equation (4) by OLS while omitting the variable $E(u|x,z)$ leads to omitted variables bias. As you probably recall, the bias term from omitted variables (when the omitted variable has a coefficient of 1) is controlled by the coefficients from the following auxiliary regression: \begin{align} E(u|z) = x\alpha_1 + z\alpha_2 + \nu \end{align} The bias in the original regression for $\beta$ is $\alpha_1$ from this regression, and the bias on $\gamma$ is $\alpha_2$. If $x$ is correlated with $E(u|z)$, after controlling linearly for $z$, then $\alpha_1$ will be non-zero and the OLS coefficient will be biased.

Here is an example to prove the point: \begin{align} \xi &\sim F(), \; \zeta \sim G(), \; \nu \sim H()\quad \text{all independent}\\ z &=\xi\\ x &= z^2 + \zeta\\ u &= z+z^2-E(z+z^2)+\nu \end{align}

Looking at the formula for $u$, it is clear that $E(u|x,z)=E(u|z)=z+z^2-E(z+z^2)$ Looking at the auxiliary regression, it is clear that (absent some fortuitous choice of $F,G,H$) $\alpha_1$ will not be zero.

Here is a very simple example in R which demonstrates the point:

set.seed(12344321)
z <- runif(n=100000,min=0,max=10)
x <- z^2 + runif(n=100000,min=0,max=20)
u <- z + z^2 - mean(z+z^2) + rnorm(n=100000,mean=0,sd=20)
y <- x + z + u

summary(lm(y~x+z))

# auxiliary regression
summary(lm(z+z^2~x+z))

Notice that the first regression gives you a coefficient on $x$ which is biased up by 0.63, reflecting the fact that $x$ "has some $z^2$ in it" as does $E(u|z)$. Notice also that the auxiliary regression gives you a bias estimate of about 0.63.

So, what are Stock and Watson (and your lecturer) talking about? Let's go back to your equation (4): \begin{align} y = x\beta + z\gamma + E(u|z) + v \end{align}

It's an important fact that the omitted variable is only a function of $z$. It seems like if we could control for $z$ really well, that would be enough to purge the bias from the regression, even though $x$ may be correlated with $u$.

Suppose we estimated the equation below using either a non-parametric method to estimate the function $f()$ or using the correct functional form $f(z)=z\gamma+E(u|z)$. If we were using the correct functional form, we would be estimating it by non-linear least squares (explaining the cryptic comment about NLS): \begin{align} y = x\beta + f(z) + v \end{align} That would give us a consistent estimator for $\beta$ because there is no longer an omitted variable problem.

Alternatively, if we had enough data, we could go ``all the way'' in controlling for $z$. We could look at a subset of the data where $z=1$, and just run the regression: \begin{align} y = x\beta + v \end{align} This would give unbiased, consistent estimators for the $\beta$ except for the intercept, of course, which would be polluted by $f(1)$. Obviously, you could also get a (different) consistent, unbiased estimator by running that regression only on data points for which $z=2$. And another one for the points where $z=3$. Etc. Then you'd have a bunch of good estimators from which you could make a great estimator by, say, averaging them all together somehow.

This latter thought is the inspiration for matching estimators. Since we don't usually have enough data to literally run the regression only for $z=1$ or even for pairs of points where $z$ is identical, we instead run the regression for points where $z$ is ``close enough'' to being identical.