What is the right way to think of omitted variable bias in a regression that only has dummy variables?
Let's say I have the following equation:
(1) y=β0+β1×1+β2×2+β3×3+ϵ,
where y is the price of shoes; and x1, x2 and x3 are dummy variables for three different regions (x4, the reference region was omitted).
And I suspect that I am missing a variable (x4) to adjust for 'level of urbanization of each region' — each region contains an uneven number of areas with different levels of urbanization. Thus, the true model is, or so I suspect:
(2) y=β0+β1×1+β2×2+β3×3+δx4+ϵ
Now I know I can sign the bias of any one of the coefficients in equation (1) if two conditions are met: a) x4 is correlated with either x1, x2 or x3; and b) x4 has an impact on y (i.e. δ>0).
However, I am not sure if I can explore condition a) above since the correlation of x4, in this case, would be with a categorical variable, not a continuous one.
How can I go about this?
Best Answer
You're right that the requirement is $\mathrm{cov}\left(x_4,x_1\right)\neq0$. The important part is that $\mathrm{cov}$ doesn't care if any of the variables (or both) is continuous or categorical. You can calculate it, irrespectively of what is their nature!
Concretely in your case: if $x_1$ is binary, and $x_4$ is continuous then
$\mathrm{cov}\left(x_4,x_1\right)=\frac{1}{n}\sum_{i=1}^n \left(x_{1i}-\overline{x_1}\right)\left(x_{4i}-\overline{x_4}\right),$
i.e. the definition is totally the same. What you have to note in this formula, is that you can calculate everything here, it doesn't matter that $x_1$ is binary!
Yes, the formula will become
$\mathrm{cov}\left(x_4,x_1\right)=\frac{1}{n}\sum_{i=1}^n \left(x_{1i}-\overline{x_1}\right)\left(x_{4i}-\overline{x_4}\right)=\frac{-\overline{x_1}\sum_{x_{1i}=0}\left(x_{4i}-\overline{x_4}\right) + \left(1-\overline{x_1}\right)\sum_{x_{1i}=1}\left(x_{4i}-\overline{x_4}\right)}{n},$
but that is just a technical question (simplification).