Solved – Proof that omitted variable bias may lead to endogeneity

biaseconometricsendogeneityproofregression

I am looking for a proof that omitted variable bias (OVB) in OLS regression may lead to endogeneity. I have found many examples here and out there on how to prove that a given parameter $b_{j}$ (where $j=1,…,J$ parameters in the model) gets biased, for instance these two threads:

But this is not exactly what I want. What I want is a more generic proof that a given variable $X_j$ gets correlated with the error term $e$ when there is OVB, i.e., that ${\rm Cov}(X_j,e) \ne 0$.

For instance, let's say the correct equation would be:
$$Y = b_0 + b_1 X_1 + b_2 X_2 + b_3 X_3 + u$$
But we estimate the following:
$$Y = b_0 + b_1 X_1 + b_2 X_2 + e,$$
where we are omitting $X_3$ and of course its coefficient $b_3$.

Assuming that ${\rm Cov}(X_3, X_2) \ne 0$, how is it possible to prove that ${\rm Cov}(X_2,e) \ne 0$ and therefore $X_2$ is endogenous due to OVB, instead of just calculating the amount of bias in $b_2$?

Best Answer

To prove this, start from the probability limit of the OLS estimator. Let $X$ denote the full matrix of regressors to be used, $[1,X_1,X_2]$, and let $e \equiv u + b_3 X_3$. Also, let $b$ be the parameters we are trying to estimate, i.e. $b = (b_0,b_1,b_2)$.

\begin{align*} p\lim \hat{\beta} &= p\lim \left[ (X'X)^{-1}X'Y \right] \\ &= p\lim \left[ (X'X)^{-1}X'Y \right] \\ &= p\lim \left[ (X'X)^{-1}X'(Xb + e) \right] \\ &= p\lim \left[ (X'X)^{-1}X'Xb \right] + p\lim \left[ (X'X)^{-1}X'e \right] \\ &= p\lim \left[ (X'X)^{-1}X'X \right] b + p\lim \left[ (X'X)^{-1}X'(b_3 X_3 + u) \right] \\ &= b + b_3 p\lim \left[ (X'X)^{-1}X' X_3 \right] + p\lim \left[ (X'X)^{-1}X'u \right] \\ &= b + b_3 p\lim \left[ (X'X)^{-1}X' X_3 \right] \\ &= b + b_3 \mathbb{E}(X'X)]^{-1} \mathbb{E}(X' X_3) \end{align*}

Above, a key step is of course that $p\lim \left[ (X'X)^{-1}X'u \right] =0$, which happens because

$$ p\lim \left[ (X'X)^{-1}X'u \right] = (p\lim X'X)^{-1} p\lim (X'u) = [\mathbb{E}(X'X)]^{-1} \mathbb{E}(X'u) $$, since $\mathbb{E}(X'u)=0$, which holds because the original assumption is that each of the regressors are uncorrelated with $u$ (but not necessarily $e$).

Now we see that $p\lim \hat{\beta} \ne b$ whenever $\mathbb{E}(X'X_3) \ne 0$, that is whenever there is correlation between $X_1$ and $X_3$ or between $X_2$ and $X_3$.

Related Solutions

Omitted Variable Bias in Linear Regression – What It Is and How to Address It?

The main issue here is the nature of the omitted variable bias. Wikipedia states:

Two conditions must hold true for omitted-variable bias to exist in linear regression:

the omitted variable must be a determinant of the dependent variable (i.e., its true regression coefficient is not zero); and

the omitted variable must be correlated with one or more of the included independent variables (i.e. cov(z,x) is not equal to zero).

It's important to carefully note the second criterion. Your betas will only be biased under certain circumstances. Specifically, if there are two variables that contribute to the response that are correlated with each other, but you only include one of them, then (in essence) the effects of both will be attributed to the included variable, causing bias in the estimation of that parameter. So perhaps only some of your betas are biased, not necessarily all of them.

Another disturbing possibility is that if your sample is not representative of the population (which it rarely really is), and you omit a relevant variable, even if it's uncorrelated with the other variables, this could cause a vertical shift which biases your estimate of the intercept. For example, imagine a variable, $Z$, increases the level of the response, and that your sample is drawn from the upper half of the $Z$ distribution, but $Z$ is not included in your model. Then, your estimate of the population mean response (and the intercept) will be biased high despite the fact that $Z$ is uncorrelated with the other variables. Additionally, there is the possibility that there is an interaction between $Z$ and variables in your model. This can also cause bias without $Z$ being correlated with your variables (I discuss this idea in my answer here.)

Now, given that in its equilibrium state, everything is ultimately correlated with everything in the world, we might find this all very troubling. Indeed, when doing observational research, it is best to always assume that every variable is endogenous.

There are, however, limits to this (c.f., Cornfield's Inequality). First, conducting true experiments breaks the correlation between a focal variable (the treatment) and any otherwise relevant, but unobserved, explanatory variables. There are some statistical techniques that can be used with observational data to account for such unobserved confounds (prototypically: instrumental variables regression, but also others).

Setting these possibilities aside (they probably do represent a minority of modeling approaches), what is the long-run prospect for science? This depends on the magnitude of the bias, and the volume of exploratory research that gets done. Even if the numbers are somewhat off, they may often be in the neighborhood, and sufficiently close that relationships can be discovered. Then, in the long run, researchers can become clearer on which variables are relevant. Indeed, modelers sometimes explicitly trade off increased bias for decreased variance in the sampling distributions of their parameters (c.f., my answer here). In the short run, it's worth always remembering the famous quote from Box:

All models are wrong, but some are useful.

There is also a potentially deeper philosophical question here: What does it mean that the estimate is being biased? What is supposed to be the 'correct' answer? If you gather some observational data about the association between two variables (call them $X$ & $Y$), what you are getting is ultimately the marginal correlation between those two variables. This is only the 'wrong' number if you think you are doing something else, and getting the direct association instead. Likewise, in a study to develop a predictive model, what you care about is whether, in the future, you will be able to accurately guess the value of an unknown $Y$ from a known $X$. If you can, it doesn't matter if that's (in part) because $X$ is correlated with $Z$ which is contributing to the resulting value of $Y$. You wanted to be able to predict $Y$, and you can.

Regression – Estimating $b_1 x_1+b_2 x_2$ Instead of $b_1 x_1+b_2 x_2+b_3x_3$

The issue you need to worry about is called endogeneity. More specifically, it depends on whether $x_3$ is correlated in the population with $x_1$ or $x_2$. If it is, then the associated $b_j$s will be biased. That is because OLS regression methods force the residuals, $u_i$, to be uncorrelated with your covariates, $x_j$s. However, your residuals are composed of some irreducible randomness, $\varepsilon_i$, and the unobserved (but relevant) variable, $x_3$, which by stipulation is correlated with $x_1$ and / or $x_2$. On the other hand, if both $x_1$ and $x_2$ are uncorrelated with $x_3$ in the population, then their $b$s won't be biased by this (they may well be biased by something else, of course). One way econometricians try to deal with this issue is by using instrumental variables.

For the sake of greater clarity, I've written a quick simulation in R that demonstrates the sampling distribution of $b_2$ is unbiased / centered on the true value of $\beta_2$, when it is uncorrelated with $x_3$. In the second run, however, note that $x_3$ is uncorrelated with $x_1$, but not $x_2$. Not coincidentally, $b_1$ is unbiased, but $b_2$ is biased.

library(MASS)                          # you'll need this package below
N     = 100                            # this is how much data we'll use
beta0 = -71                            # these are the true values of the
beta1 = .84                            # parameters
beta2 = .64
beta3 = .34

############## uncorrelated version

b0VectU = vector(length=10000)         # these will store the parameter
b1VectU = vector(length=10000)         # estimates
b2VectU = vector(length=10000)
set.seed(7508)                         # this makes the simulation reproducible

for(i in 1:10000){                     # we'll do this 10k times
  x1 = rnorm(N)
  x2 = rnorm(N)                        # these variables are uncorrelated
  x3 = rnorm(N)
  y  = beta0 + beta1*x1 + beta2*x2 + beta3*x3 + rnorm(100)
  mod = lm(y~x1+x2)                    # note all 3 variables are relevant
                                       # but the model omits x3
  b0VectU[i] = coef(mod)[1]            # here I'm storing the estimates
  b1VectU[i] = coef(mod)[2]
  b2VectU[i] = coef(mod)[3]
}
mean(b0VectU)  # [1] -71.00005         # all 3 of these are centered on the
mean(b1VectU)  # [1] 0.8399306         # the true values / are unbiased
mean(b2VectU)  # [1] 0.6398391         # e.g., .64 = .64

############## correlated version

r23 = .7                               # this will be the correlation in the
b0VectC = vector(length=10000)         # population between x2 & x3
b1VectC = vector(length=10000)
b2VectC = vector(length=10000)
set.seed(2734)

for(i in 1:10000){
  x1 = rnorm(N)
  X  = mvrnorm(N, mu=c(0,0), Sigma=rbind(c(  1, r23),
                                         c(r23,   1)))
  x2 = X[,1]
  x3 = X[,2]                           # x3 is correated w/ x2, but not x1
  y  = beta0 + beta1*x1 + beta2*x2 + beta3*x3 + rnorm(100)
                                       # once again, all 3 variables are relevant
  mod = lm(y~x1+x2)                    # but the model omits x3
  b0VectC[i] = coef(mod)[1]
  b1VectC[i] = coef(mod)[2]            # we store the estimates again
  b2VectC[i] = coef(mod)[3]
}
mean(b0VectC)  # [1] -70.99916         # the 1st 2 are unbiased
mean(b1VectC)  # [1] 0.8409656         # but the sampling dist of x2 is biased
mean(b2VectC)  # [1] 0.8784184         # .88 not equal to .64

Best Answer

Related Solutions

Omitted Variable Bias in Linear Regression – What It Is and How to Address It?

Regression – Estimating $b_1 x_1+b_2 x_2$ Instead of $b_1 x_1+b_2 x_2+b_3x_3$

Related Question