I have a theoretical economic model which is as follows,
$$ y = a + b_1x_1 + b_2x_2 + b_3x_3 + u $$
So theory says that there are $x_1$, $x_2$ and $x_3$ factors to estimate $y$.
Now I have the real data and I need to estimate $b_1$, $b_2$, $b_3$. The problem is that the real data set contains only data for $x_1$ and $x_2$; there are no data for $x_3$. So the model I can fit actually is:
$$y = a + b_1x_1 + b_2x_2 + u$$
- Is it OK to estimate this model?
- Do I lose anything estimating it?
- If I do estimate $b_1$, $b_2$, then where does the $b_3x_3$ term go?
- Is it accounted for by error term $u$?
And we would like to assume that $x_3$ is not correlated with $x_1$ and $x_2$.
Best Answer
The issue you need to worry about is called endogeneity. More specifically, it depends on whether $x_3$ is correlated in the population with $x_1$ or $x_2$. If it is, then the associated $b_j$s will be biased. That is because OLS regression methods force the residuals, $u_i$, to be uncorrelated with your covariates, $x_j$s. However, your residuals are composed of some irreducible randomness, $\varepsilon_i$, and the unobserved (but relevant) variable, $x_3$, which by stipulation is correlated with $x_1$ and / or $x_2$. On the other hand, if both $x_1$ and $x_2$ are uncorrelated with $x_3$ in the population, then their $b$s won't be biased by this (they may well be biased by something else, of course). One way econometricians try to deal with this issue is by using instrumental variables.
For the sake of greater clarity, I've written a quick simulation in R that demonstrates the sampling distribution of $b_2$ is unbiased / centered on the true value of $\beta_2$, when it is uncorrelated with $x_3$. In the second run, however, note that $x_3$ is uncorrelated with $x_1$, but not $x_2$. Not coincidentally, $b_1$ is unbiased, but $b_2$ is biased.