Solved – Conditional mean independence implies unbiasedness and consistency of the OLS estimator

econometricsleast squaresmultiple regressionnonlinear regressionregression

Consider the following multiple regression model: $$Y=X\beta+Z\delta+U.\tag{1}$$

Here $Y$ is a $n\times 1$ column vector; $X$ a $n\times (k+1)$ matrix; $\beta$ a $(k+1)\times 1$ column vector; $Z$ a $n\times l$ matrix; $\delta$ a $l\times 1$ column vector; and $U$, the error term, a $n\times1$ column vector.

QUESTION

My lecturer, the textbook Introduction to Econometrics, 3rd ed.
by James H. Stock and Mark W. Watson, p. 281, and Econometrics: Honor's Exam Review Session (PDF), p. 7, has expressed the following to me.

If we assume what is called conditional mean independence, which by
definition means that $$E(U|X,Z)=E(U|Z),\tag{2}$$
and if the least squares assumption are satisfied except the conditional mean zero assumption $E(U|X,Z)=0$ (so we assume $E(U|X,Z)=E(U|Z) \neq 0$)(see 1-3 below),
then, the OLS
estimator $\hat{\beta}$ of $\beta$ in $(1)$ remains unbiased and consistent, under this weaker set of assumptions.

How do I prove this proposition? I.e., that 1 and 2 above implies that the OLS estimate of $\beta$ gives us an unbiased and consistent estimator for $\beta$? Is there any research article proving this proposition?

COMMENT

The simplest case is given by considering the linear regression model $$Y_i=\beta_0+\beta_1X_i+\beta_2Z_i+u_i,\quad i=1,2,\ldots,n,$$ and prove that the OLS estimate $\hat{\beta}_1$ of $\beta_1$ is unbiased if $E(u_i|X_i,Z_i)=E(u_i|Z_i)$ for each $i$.

PROOF OF UNBIASEDNESS ASSUMING THAT $U_i$ AND $Z_i$ ARE JOINTLY NORMALLY DISTRIBUTED

Define $V=U-E(U|X,Z)$, then $U=V+E(U|X,Z)$ and $$E(V|X,Z)=0\tag{*}.$$ Thus $(1)$ may be rewritten as $$Y=X\beta+Z\delta+E(U|X,Z)+V.\tag{3}$$ By $(2)$ it then follows that $$Y=X\beta+Z\delta+E(U|Z)+V.\tag{4}$$ Now, since $U_i$ and $Z_i$ are jointly normally distributed, the theory of normal distributions, cf. Deriving the conditional distributions of a multivariate normal distribution, says that (indeed, we do not need to assume joint normality but only this identity) $$E(U|Z)=Z\gamma\tag{**}$$ for some $l$ by $1$ vector $\gamma\neq\textbf{0}$.

Now $(4)$ becomes $$Y=X\beta+Z(\delta+\gamma)+V.\tag{5}$$ For the model $(5)$ all the least squares assumption are satisfied, as the error term $V$ satisfies the assumption of conditional mean zero. This implies that the OLS estimate $\hat{\beta}$ of $\beta$ will be unbiased, for if we let $\rho=\delta+\gamma$, and let $W=(X,Z)$ be the $n$ by $(k+1)+l$ matrix composed of $X$ and $Z$, then the OLS estimate of $\beta$ in $(5)$ is given by considering the following: \begin{align}(\hat{\beta}^T,\hat{\rho}^T)^T &=(W^TW)^{-1}W^TY\\
&=(W^TW)^{-1}W^T(W(\beta^T,\rho^T)^T+V)\\
&=(\beta^T,\rho^T)^T+(W^TW)^{-1}W^TV\end{align}

and thus \begin{align}E((\hat{\beta}^T,\hat{\rho}^T)^T|W)&=(\beta^T,\rho^T)^T+(W^TW)^{-1}Ws^TE(V|W)\\&=(\beta^T,\rho^T)^T+(W^TW)^{-1}W^T\textbf{0}\\&=(\beta^T,\rho^T)^T,\end{align} where the second line follows by $(*)$. Thus $\hat{\beta}$ is a conditionally unbiased estimate of $\beta$ since the OLS estimate given for model $(1)$ coinicides with that given for model $(5)$. Now, by the law of total expectation \begin{align}E(\hat{\beta})&=E(E(\hat{\beta}|W))\\ &=E(\beta)\\ &=\beta,\end{align} and thus $\hat{\beta}$ is an unbiased estimator for $\beta$.

(One may note that $E(\hat{\rho})=\rho=\delta+\gamma\neq\delta$, so that the coefficient on $Z$ is not necessarily unbiased.)

However, the special case above assumes that $U_i$ and $Z_i$ are jointly normally distributed, how do I prove the proposition without this assumption?

Assuming that $E(U|Z)=Z\gamma$ suffices always of course (cf. $(**)$), but I am supposed to derive the result just using $(2)$ and the least squares assumption excluding the Conditional Mean Zero assumption (see below).

REGARDING CONSISTENCY

I think one can also see that the estimate $\hat{\beta}$ is consistent for $\beta$ by noticing that in the regression model $(5)$ all least squares assumption are satisfied, including the assumption that the (new) error term $V$ satisfies the Conditional Mean Zero assumption (cf. $(*)$ and see below).

I may add a proof of consistency later on which is based on a series of exercises in Introduction to Econometrics, 3rd ed. by James H. Stock and Mark W. Watson, ch. 18. However, this proof is quite long. But the point here is that the proof provided in the exercises assumes $(**)$, so I am still wondering whether the assumption $(2)$ really suffices.

SUBQUERY 1

In Introduction to Econometrics, 3rd ed. by James H. Stock and Mark W. Watson, it is said, at p. 300, that the assumption $(**)$ can be "relaxed" using the theory of nonlinear regression. What do or may they mean by this?

THE LEAST SQUARES ASSUMPTIONS

Here I exclude the conditional mean zero assumption that $E(U|X,Z)=0$ since the proposition which we try to prove here allows for cases where $E(U|X,Z)\neq 0$. These are e.g. cases when $Z$ is correlated with $U$. Cf. Econometrics: Honor's Exam Review Session (PDF), p. 7.

The least squares assumption are the following.

The joint distributions of $(Y_i,X_i,Z_i)$, $i=1,2,\ldots,n,$ are i.i.d., where $Y_i$ is the $i$:th element in $Y$ and where $X_i$ and $Z_i$ are the $i$:th row vectors in $X$ and $Z$.
Large outliers are unlikely, i.e., for each $i$, $X_i, Z_i$ and $U_i$ have finite fourth moments, where $U_i$ is the $i$:th element in $U$.
$(X,Z)$ has full column rank (i.e., there is no perfect multicollinearity; this ensures the invertibility of $W^TW$).
(Extended least squares assumptions: While I do not think this is necessary (and it has been said to me that it is not), we may also assume homoskedasticity, i.e. $\text{Var}(U_i|X_i,Z_i)=\sigma^2_U$ for each $i$, and that the conditional distribution of $U_i$ given $(X_i,Z_i)$ is normal for each $i$ (i.e., we have normal errors.))

NOTE ON TERMINOLOGY

In $(1)$, the Conditional Mean Zero assumption is the assumption that $E(U|X,Z)=0$. The Conditional Mean Independence assumption, however, is the assumption that $E(U|X,Z)=E(U|Z)$.

This terminology is used in e.g. Introduction to Econometrics, 3rd ed. by James H. Stock and Mark W. Watson, p. 281; and Econometric Analysis of Cross Section and Panel Data, 1st ed. by Jeffrey M. Wooldridge, p. 607. See also Conditional Independence Restrictions: Testing and Estimation for similar discussions.

ADDITIONAL THOUGHTS AND SUBQUERY 2

I think contrary to James H. Stock and Mark W. Watson that conditional mean independence does not ensure an unbiased OLS estimate of $\beta$. This is because $E(U|Z)$ may take on nonlinear forms like $E(U|Z)=p(Z)$ where $p(Z)$ is a polynomial in $Z$, or $E(U|Z)=\exp( Z\gamma)$ where $\gamma$ is some parameter yet to be estimated (here I am using the matrix exponential), and then, I think, nonlinear regression has to be applied, which generally leaves us with biased estimates. Also, the OLS estimate in (1) of $\beta$ may not even coincide with the OLS estimate of $\beta$ in $(4)$ if $E(U|Z)$ takes on certain nonlinear forms. (Psychologically I also feel that the statement made in the book by Stock & Watson is too good to be true.)

Thus, an additional question is if there is some counterexample to the proposition that conditional mean independence leads to an unbiased OLS estimate?

SUBQUERY 3

In Mostly Harmless Econometrics Angrist & Pischke argues in subsection 3.3, p. 68–91, that under conditional independence (CI), i.e. $Y$ being independent of $X$ given $W$ (which is a stronger condition, I guess, than the conditional mean independence assumption given above), there is a tight connection between matching estimates of the effect of $X$ on $Y$ and the coefficients on $X$ in the regression of $Y$ on $X$ and $W$ which motivates that under CI the OLS estimate of the coefficient on $X$ in $(1)$ is less biased than if CI does not hold (all else equal).

Now, can this idea somehow be used to answer my main question here?

Best Answer

It's false. As you observe, if you read Stock and Watson closely, they don't actually endorse the claim that OLS is unbiased for $\beta$ under conditional mean independence. They endorse the much weaker claim that OLS is unbiased for $\beta$ if $E(u|x,z)=z\gamma$. Then, they say something vague about non-linear least squares.

Your equation (4) contains what you need to see that the claim is false. Estimating equation (4) by OLS while omitting the variable $E(u|x,z)$ leads to omitted variables bias. As you probably recall, the bias term from omitted variables (when the omitted variable has a coefficient of 1) is controlled by the coefficients from the following auxiliary regression: \begin{align} E(u|z) = x\alpha_1 + z\alpha_2 + \nu \end{align} The bias in the original regression for $\beta$ is $\alpha_1$ from this regression, and the bias on $\gamma$ is $\alpha_2$. If $x$ is correlated with $E(u|z)$, after controlling linearly for $z$, then $\alpha_1$ will be non-zero and the OLS coefficient will be biased.

Here is an example to prove the point: \begin{align} \xi &\sim F(), \; \zeta \sim G(), \; \nu \sim H()\quad \text{all independent}\\ z &=\xi\\ x &= z^2 + \zeta\\ u &= z+z^2-E(z+z^2)+\nu \end{align}

Looking at the formula for $u$, it is clear that $E(u|x,z)=E(u|z)=z+z^2-E(z+z^2)$ Looking at the auxiliary regression, it is clear that (absent some fortuitous choice of $F,G,H$) $\alpha_1$ will not be zero.

Here is a very simple example in R which demonstrates the point:

set.seed(12344321)
z <- runif(n=100000,min=0,max=10)
x <- z^2 + runif(n=100000,min=0,max=20)
u <- z + z^2 - mean(z+z^2) + rnorm(n=100000,mean=0,sd=20)
y <- x + z + u

summary(lm(y~x+z))

# auxiliary regression
summary(lm(z+z^2~x+z))

Notice that the first regression gives you a coefficient on $x$ which is biased up by 0.63, reflecting the fact that $x$ "has some $z^2$ in it" as does $E(u|z)$. Notice also that the auxiliary regression gives you a bias estimate of about 0.63.

So, what are Stock and Watson (and your lecturer) talking about? Let's go back to your equation (4): \begin{align} y = x\beta + z\gamma + E(u|z) + v \end{align}

It's an important fact that the omitted variable is only a function of $z$. It seems like if we could control for $z$ really well, that would be enough to purge the bias from the regression, even though $x$ may be correlated with $u$.

Suppose we estimated the equation below using either a non-parametric method to estimate the function $f()$ or using the correct functional form $f(z)=z\gamma+E(u|z)$. If we were using the correct functional form, we would be estimating it by non-linear least squares (explaining the cryptic comment about NLS): \begin{align} y = x\beta + f(z) + v \end{align} That would give us a consistent estimator for $\beta$ because there is no longer an omitted variable problem.

Alternatively, if we had enough data, we could go ``all the way'' in controlling for $z$. We could look at a subset of the data where $z=1$, and just run the regression: \begin{align} y = x\beta + v \end{align} This would give unbiased, consistent estimators for the $\beta$ except for the intercept, of course, which would be polluted by $f(1)$. Obviously, you could also get a (different) consistent, unbiased estimator by running that regression only on data points for which $z=2$. And another one for the points where $z=3$. Etc. Then you'd have a bunch of good estimators from which you could make a great estimator by, say, averaging them all together somehow.

This latter thought is the inspiration for matching estimators. Since we don't usually have enough data to literally run the regression only for $z=1$ or even for pairs of points where $z$ is identical, we instead run the regression for points where $z$ is ``close enough'' to being identical.

Some simple models for expected returns

``Market model" $$ R_t - R^f = \alpha + \beta\left(R^m_t - R^f \right) + \epsilon_t $$ $$ E\left[ R_t \right] - R^f = \alpha + \beta\left(E[R^m_t] - R^f \right) $$ The market model is a simple, statistical model and can be justified by assuming that the joint distribution of monthly stock returns is multivariate normal.
Capital Asset Pricing Model (CAPM) $$ E\left[ R_t\right] - R^f = \beta\left(E[R^m_t] - R^f \right) $$ The CAPM is an economic theory that expected excess returns of a stock are linear in the excess return of the market, that $\alpha = 0$ from the market model regression.

Be aware that the CAPM doesn't work. It's all over MBA corporate finance, but asset pricing people find it useless. Something less crazy to use would be the Fama-French 3 Factor Model.

Example of how to use the CAPM (or any of these factor asset pricing models).

Compute excess returns: $ R_{i,t} - R^f_t$
Regress excess returns on excess returns of the market and a constant (i.e. run the market model regression). $$ R_{i,t} - R^f_t = \alpha_i + \beta_i \left( R^m_t - R^f_t \right) + \epsilon_{i,t}$$
Ignore the estimated $\hat{\alpha}$.
Your estimated expected excess return according to the CAPM is $\hat{\beta_i} E[R^m_t - R^f_t] $.

Solved – The Least Squares Assumptions

You do not need assumptions on the 4th moments for consistency of the OLS estimator, but you do need assumptions on higher moments of $x$ and $\epsilon$ for asymptotic normality and to consistently estimate what the asymptotic covariance matrix is.

In some sense though, that is a mathematical, technical point, not a practical point. For OLS to work well in finite samples in some sense requires more than the minimal assumptions necessary to achieve asymptotic consistency or normality as $n \rightarrow \infty$.

Sufficient conditions for consistency:

If you have regression equation: $$ y_i = \mathbf{x}_i' \boldsymbol{\beta} + \epsilon_i $$

The OLS estimator $\hat{\mathbf{b}}$ can be written as: $$ \hat{\mathbf{b}} = \boldsymbol{\beta} + \left( \frac{X'X}{n}\right)^{-1}\left(\frac{X'\boldsymbol{\epsilon}}{n} \right)$$

For consistency, you need to be able to apply Kolmogorov's Law of Large Numbers or, in the case of time-series with serial dependence, something like the Ergodic Theorem of Karlin and Taylor so that:

$$ \frac{1}{n} X'X \xrightarrow{p} \mathrm{E}[\mathbf{x}_i\mathbf{x}_i'] \quad \quad \quad \frac{1}{n} X'\boldsymbol{\epsilon} \xrightarrow{p} \mathrm{E}\left[\mathbf{x}_i' \epsilon_i\right] $$

Other assumptions needed are:

$\mathrm{E}[\mathbf{x}_i\mathbf{x}_i']$ is full rank and hence the matrix is invertible.
Regressors are predetermined or strictly exogenous so that $\mathrm{E}\left[\mathbf{x}_i \epsilon_i\right] = \mathbf{0}$.

Then $\left( \frac{X'X}{n}\right)^{-1}\left(\frac{X'\boldsymbol{\epsilon}}{n} \right) \xrightarrow{p} \mathbf{0}$ and you get $\hat{\mathbf{b}} \xrightarrow{p} \boldsymbol{\beta}$

If you want the central limit theorem to apply then you need assumptions on higher moments, for example, $\mathrm{E}[\mathbf{g}_i\mathbf{g}_i']$ where $\mathbf{g_i} = \mathbf{x}_i \epsilon_i$. The central limit theorem is what gives you asymptotic normality of $\hat{\mathbf{b}}$ and allows you to talk about standard errors. For the second moment $\mathrm{E}[\mathbf{g}_i\mathbf{g}_i']$ to exist, you need the 4th moments of $x$ and $\epsilon$ to exist. You want to argue that $\sqrt{n}\left(\frac{1}{n} \sum_i \mathbf{x}_i' \epsilon_i \right) \xrightarrow{d} \mathcal{N}\left( 0, \Sigma \right)$ where $\Sigma = \mathrm{E}\left[\mathbf{x}_i\mathbf{x}_i'\epsilon_i^2 \right]$. For this to work, $\Sigma$ has to be finite.

A nice discussion (which motivated this post) is given in Hayashi's Econometrics. (See also p. 149 for 4th moments and estimating the covariance matrix.)

Discussion:

These requirements on 4th moments is probably a technical point rather than a practical point. You're probably not going to encounter pathological distributions where this is a problem in everyday data? It's for more commonf or other assumptions of OLS to go awry.

A different question, undoubtedly answered elsewhere on Stackexchange, is how large of a sample you need for finite samples to get close to the asymptotic results. There's some sense in which fantastic outliers lead to slow convergence. For example, try estimating the mean of a lognormal distribution with really high variance. The sample mean is a consistent, unbiased estimator of the population mean, but in that log-normal case with crazy excess kurtosis etc... (follow link), finite sample results are really quite off.

Finite vs. infinite is a hugely important distinction in mathematics. That's not the problem you encounter in everyday statistics. Practical problems are more in the small vs. big category. Is the variance, kurtosis etc... small enough so that I can achieve reasonable estimates given my sample size?

Pathological example where OLS estimator is consistent but not asymptotically normal

Consider:

$$ y_i = b x_i + \epsilon_i$$ Where $x_i \sim \mathcal{N}(0,1)$ but $\epsilon_i$ is drawn from a t-distribution with 2 degrees of freedom thus $\mathrm{Var}(\epsilon_i) = \infty$. The OLS estimate converges in probability to $b$ but the sample distribution for the OLS estimate $\hat{b}$ is not normally distributed. Below is the empirical distribution for $\hat{b}$ based upon 10000 simulations of a regression with 10000 observations.

The distribution of $\hat{b}$ isn't normal, the tails are too heavy. But if you increase the degrees of freedom to 3 so that the second moment of $\epsilon_i$ exists then the central limit applies and you get:

Code to generate it:

beta = [-4; 3.7];
n = 1e5;    
n_sim = 10000;    
for s=1:n_sim
    X = [ones(n, 1), randn(n, 1)];  
    u  = trnd(2,n,1) / 100;
    y = X * beta + u;

    b(:,s) = X \ y;
end
b = b';
qqplot(b(:,2));