OLS Regression – Is There a Test for Omitted Variable Bias?

biascausalitydiagnosticmultiple regressionregression

I am aware of the Ramsey Reset test which may detect nonlinear dependencies. However, if you just throw out one of the regression coefficients (merely linear dependencies), you may get a bias, depending on the correlations. This is obviously not detected by the Reset test.

I did not find a test for this case, but this statement: "You cannot test for OVB except by including potential omitted variables". It is probably a reasonable statement, isn't it?

Best Answer

You can test for omitted variable bias without having measurements of the omitted variable if you have an instrumental variable available.

So I'd expand your statement a bit to give:

You cannot test for omitted variable bias except by including potential omitted variables unless one or more instrumental variables are available.

There are assumptions, however, some of them untestable statistically, in saying a variable is an instrumental variable. So if you don't have measurements of a potential omitted variable, you can't avoid omitted variable bias without making some assumptions.

Related Solutions

Solved – Causality, omitted variable bias

You need to distinguish the causal graph from the regression coefficients here. Something is only 'spurious' if it does not identify the causal effect of interest, and this depends on the graph structure you have assumed, not on any regression coefficients.

As an example (and restricting ourselves to causal DAG structures with no hidden variables) assume X causes Y and X causes Z. Then even if Z does not cause Y you will be able to regress the Y on Z and get a non-zero coefficient, so that doesn't tell you much. Conditioning on X in a regression of Y on Z is the right thing to do if you want to know what the causal effect of Z is on Y assuming that X causes both Y and Z and that Z causes Y rather than vice versa. If, on the other hand, Y causes Z, then despite there being no causal effect to estimate you will again get a non-zero regression coefficient.

It all depends on which variables are connected by causal arrows and which direction those arrows point. It's sometimes useful to simulate data with the relevant structure and run the regressions to get a feel for what can happen.

There are some situations where causal structure can be inferred from regressing things on other things and finding zero coefficients, but they are fairly limited. A nice overview can be found in chapter 25 of Shalizi's draft textbook (ch.21-24 are also worth reading). Leaving aside discovery, the basic theoretical framework can be found in compressed form in Pearl's review paper, and as a more leisurely exposition in the references here.

Unfortunately this means that the answer to each of your three questions is "it depends" (on the graph), but the references above should hopefully point you towards what you would have to assume to interpret things they way you're considering.

Regression – Omitted Variable Bias Verification in Gretl

Where is my misunderstanding ? Is the formula above valid only in the limit of infinite sample size?

The formula is valid and exact for all sample sizes. Your misunderstanding is a simple typo, there's no misunderstanding at all.

When you wrote:

But these values do not quite match the expression from above: $0.124836+15.1982*0.000774748 = 0.1366108 \neq 0.140211$

Somehow you put $0.124836$ instead of $0.128436$ switching the $4$ for the $8$. Fixing the typo gives you the expected result:

$$0.128436+15.1982*0.000774748=0.140211$$

The proof is rather simple. Let $Y$ denote price, $X$ denote sqrft and $Z$ denote bdrms. Then:

$$ \tilde{\beta}_1 = \frac{cov(X, Y)}{var(X)}= \frac{cov(X, \hat{\beta}_1X + \hat{\beta}_2Z + \hat{\epsilon})}{var(X)} = \hat{\beta}_1 + \hat{\beta}_2\frac{cov(X, Z)}{var(X)} = \hat{\beta}_1+ \hat{\beta}_2\tilde{\delta}_1 $$

Where we know $cov(X, \hat{\epsilon}) = 0$ by construction in OLS and $\frac{cov(X, Z)}{var(X)}$ is the coefficient of regressing $Z \sim X$ which we are denoting for $\tilde{\delta}_1$.

This relationship is exact and just a simple property of the algebra of OLS.

If you want to manually check this in R, there's a package called wooldridge with all the datasets from the textbook:

library(wooldridge)
data("hprice1")
coef(lm(price ~ sqrft, hprice1))[2]
#  sqrft 
# 0.140211 
coef(lm(price ~ sqrft + bdrms, hprice1))[2] + 
  coef(lm(price ~ sqrft + bdrms, hprice1))[3]*coef(lm(bdrms ~ sqrft , hprice1))[2]
#  sqrft 
# 0.140211

Best Answer

Related Solutions

Solved – Causality, omitted variable bias

Regression – Omitted Variable Bias Verification in Gretl

Related Question