You need to distinguish the causal graph from the regression coefficients here. Something is only 'spurious' if it does not identify the causal effect of interest, and this depends on the graph structure you have assumed, not on any regression coefficients.
As an example (and restricting ourselves to causal DAG structures with no hidden variables) assume X causes Y and X causes Z. Then even if Z does not cause Y you will be able to regress the Y on Z and get a non-zero coefficient, so that doesn't tell you much. Conditioning on X in a regression of Y on Z is the right thing to do if you want to know what the causal effect of Z is on Y assuming that X causes both Y and Z and that Z causes Y rather than vice versa. If, on the other hand, Y causes Z, then despite there being no causal effect to estimate you will again get a non-zero regression coefficient.
It all depends on which variables are connected by causal arrows and which direction those arrows point. It's sometimes useful to simulate data with the relevant structure and run the regressions to get a feel for what can happen.
There are some situations where causal structure can be inferred from regressing things on other things and finding zero coefficients, but they are fairly limited. A nice overview can be found in chapter 25 of Shalizi's draft textbook (ch.21-24 are also worth reading). Leaving aside discovery, the basic theoretical framework can be found in compressed form in Pearl's review paper, and as a more leisurely exposition in the references here.
Unfortunately this means that the answer to each of your three questions is "it depends" (on the graph), but the references above should hopefully point you towards what you would have to assume to interpret things they way you're considering.
Where is my misunderstanding ? Is the formula above valid only in the
limit of infinite sample size?
The formula is valid and exact for all sample sizes. Your misunderstanding is a simple typo, there's no misunderstanding at all.
When you wrote:
But these values do not quite match the expression from above:
$0.124836+15.1982*0.000774748 = 0.1366108 \neq 0.140211$
Somehow you put $0.124836$ instead of $0.128436$ switching the $4$ for the $8$. Fixing the typo gives you the expected result:
$$0.128436+15.1982*0.000774748=0.140211$$
The proof is rather simple. Let $Y$ denote price
, $X$ denote sqrft
and $Z$ denote bdrms
. Then:
$$
\tilde{\beta}_1 = \frac{cov(X, Y)}{var(X)}= \frac{cov(X, \hat{\beta}_1X + \hat{\beta}_2Z + \hat{\epsilon})}{var(X)} = \hat{\beta}_1 + \hat{\beta}_2\frac{cov(X, Z)}{var(X)} = \hat{\beta}_1+ \hat{\beta}_2\tilde{\delta}_1
$$
Where we know $cov(X, \hat{\epsilon}) = 0$ by construction in OLS and $\frac{cov(X, Z)}{var(X)}$ is the coefficient of regressing $Z \sim X$ which we are denoting for $\tilde{\delta}_1$.
This relationship is exact and just a simple property of the algebra of OLS.
If you want to manually check this in R
, there's a package called wooldridge
with all the datasets from the textbook:
library(wooldridge)
data("hprice1")
coef(lm(price ~ sqrft, hprice1))[2]
# sqrft
# 0.140211
coef(lm(price ~ sqrft + bdrms, hprice1))[2] +
coef(lm(price ~ sqrft + bdrms, hprice1))[3]*coef(lm(bdrms ~ sqrft , hprice1))[2]
# sqrft
# 0.140211
Best Answer
You can test for omitted variable bias without having measurements of the omitted variable if you have an instrumental variable available.
So I'd expand your statement a bit to give:
There are assumptions, however, some of them untestable statistically, in saying a variable is an instrumental variable. So if you don't have measurements of a potential omitted variable, you can't avoid omitted variable bias without making some assumptions.