Solved – Omitted Variable Bias & Multicollinearity: Why are the coefficient SEs smaller in the unbiased specification

biasmulticollinearityr

In Introductory Econometrics: A Modern Approach, Wooldridge writes the following regarding the omitted variable bias and its effect on the variance of the OLS estimator (x1 and x2 are correlated):

This intutivley makes sense given that by defintion of the omitted variable bias x1 and x2 are correlated. Thus including x2 into the regression should inflate the variance due to multicollinearity.

However when I run a simulation in R I consitently get the exact opposite of what Wooldridge suggest. Consider the data generating process:

x1 <- rnorm(10000)
x2 <- rnorm(10000) + 0.2*x1
y <- 0.5 -2*x1 -2.5*x2 + rnorm(10000)
summary(lm(y ~ x1 + x2))
summary(lm(y ~ x1))

No matter how many times I run this simulation the standard error of beta1 in the omitted variable case is always larger than in the unbaised specification.

How is this possible?

Best Answer

First, to make correlated RVs use

> rho=.2    
> x2<-rho*x1+sqrt(1-rho^2)*rnorm(10000)

Your RVs did not have the correlation you expected.

Second, by your selection of $\rho=0.2$, you've made $x_1$ and $x_2$ only weakly correlated. Omitting $x_2$ will force the linear model to "stretch" $x_1$ (high variance in $\beta_1$) to try to cover most of what $x_2$ was covering, so you are seeing the correct behavior because $x_1$ and $x_2$ are not functionally multicollinear.

If you set the correlation to 0.9, you might see what you are expecting. Here are my results using for the two cases

> x1 <- rnorm(10000)
> rho=.2 #then rho=0.9    
> x2=rho*x1+sqrt(1-rho^2)*rnorm(10000)    
> y <- 0.5 -2*x1 -2.5*x2 + rnorm(10000)    
> print(summary(lm(y ~ x1 + x2)))    
> print(summary(lm(y ~ x1)))

First $\rho=.2$ with experimental cor=0.1943525

Biased result: $se(\beta_1)=0.02691$

Unbiased result: $se(\beta_1)=0.01026$

as we would expect for NON collinear $x_1,x_2$

Now for $\rho=0.9$ with experimental cor=0.8967111

Biased result: $se(\beta_1)=0.01497$

Unbiased results: $se(\beta_1)=0.02289$

as you properly understand and expect for truly multicollinear data.

So this give us a way of determining a practical definition of multicollinearity: $\rho=.825$ is the point at which increasing the correlation causes a crossover from larger se in the biased to larger se in the unbiased model.

This is consistent with what I've read in econometrics and other social science texts, but had not tested myself until now.

Best Answer

Related Solutions

Solved – Omitted variable bias in time series

Solved – Why is post treatment bias a bias and not just multicollinearity

Related Question