Regression – Omitted Variable Bias Verification in Gretl

biascausalitygretlregression

I am trying to verify the expression for Omitted Variable Bias (OVB) as given e.g. in Wooldridge: $\tilde{\beta_1} = \hat{\beta_1} + \hat{\beta_2} \cdot \tilde{\delta_1}$, where $\tilde{\delta_1}$ is the estimated slope of the regression of $x_2$ on $x_1$.

Choosing the housing price data (hprice1.gdt) from Wooldridge available for gretl, I obtain the following estimates for the relevant regression coefficients:

Model1 (price ~ sqft)
  ------------------------
  const       11.2041         
  sqrft       0.140211                

Model2 (price ~ sqft + bdrms)
  ------------------------
  const       -19.315       
  sqrft       0.128436                
  bdrms       15.1982                 

Model3 (bdrms ~ sqft)
  ------------------------
  const       2.00808       
  sqrft       0.000774748             

So $\tilde{\beta_1}=0.140211$, $\tilde{\delta_1} = 0.000774748$, $\hat{\beta_2}=15.1982$ and $\hat{\beta_1}=0.128436$

But these values do not quite match the expression from above:
$0.124836+15.1982*0.000774748 = 0.1366108 \neq 0.140211$

Where is my misunderstanding ? Is the formula above valid only in the limit of infinite sample size?

Best Answer

Where is my misunderstanding ? Is the formula above valid only in the limit of infinite sample size?

The formula is valid and exact for all sample sizes. Your misunderstanding is a simple typo, there's no misunderstanding at all.

When you wrote:

But these values do not quite match the expression from above: $0.124836+15.1982*0.000774748 = 0.1366108 \neq 0.140211$

Somehow you put $0.124836$ instead of $0.128436$ switching the $4$ for the $8$. Fixing the typo gives you the expected result:

$$0.128436+15.1982*0.000774748=0.140211$$

The proof is rather simple. Let $Y$ denote price, $X$ denote sqrft and $Z$ denote bdrms. Then:

$$ \tilde{\beta}_1 = \frac{cov(X, Y)}{var(X)}= \frac{cov(X, \hat{\beta}_1X + \hat{\beta}_2Z + \hat{\epsilon})}{var(X)} = \hat{\beta}_1 + \hat{\beta}_2\frac{cov(X, Z)}{var(X)} = \hat{\beta}_1+ \hat{\beta}_2\tilde{\delta}_1 $$

Where we know $cov(X, \hat{\epsilon}) = 0$ by construction in OLS and $\frac{cov(X, Z)}{var(X)}$ is the coefficient of regressing $Z \sim X$ which we are denoting for $\tilde{\delta}_1$.

This relationship is exact and just a simple property of the algebra of OLS.

If you want to manually check this in R, there's a package called wooldridge with all the datasets from the textbook:

library(wooldridge)
data("hprice1")
coef(lm(price ~ sqrft, hprice1))[2]
#  sqrft 
# 0.140211 
coef(lm(price ~ sqrft + bdrms, hprice1))[2] + 
  coef(lm(price ~ sqrft + bdrms, hprice1))[3]*coef(lm(bdrms ~ sqrft , hprice1))[2]
#  sqrft 
# 0.140211