Solved – Omitted variable bias in regression only containing dumthe variables

biasmultiple regressionregression coefficients

What is the right way to think of omitted variable bias in a regression that only has dummy variables?

Let's say I have the following equation:

(1) y=β0+β1×1+β2×2+β3×3+ϵ,

where y is the price of shoes; and x1, x2 and x3 are dummy variables for three different regions (x4, the reference region was omitted).

And I suspect that I am missing a variable (x4) to adjust for 'level of urbanization of each region' — each region contains an uneven number of areas with different levels of urbanization. Thus, the true model is, or so I suspect:

(2) y=β0+β1×1+β2×2+β3×3+δx4+ϵ

Now I know I can sign the bias of any one of the coefficients in equation (1) if two conditions are met: a) x4 is correlated with either x1, x2 or x3; and b) x4 has an impact on y (i.e. δ>0).

However, I am not sure if I can explore condition a) above since the correlation of x4, in this case, would be with a categorical variable, not a continuous one.

How can I go about this?

Best Answer

You're right that the requirement is $\mathrm{cov}\left(x_4,x_1\right)\neq0$. The important part is that $\mathrm{cov}$ doesn't care if any of the variables (or both) is continuous or categorical. You can calculate it, irrespectively of what is their nature!

Concretely in your case: if $x_1$ is binary, and $x_4$ is continuous then

$\mathrm{cov}\left(x_4,x_1\right)=\frac{1}{n}\sum_{i=1}^n \left(x_{1i}-\overline{x_1}\right)\left(x_{4i}-\overline{x_4}\right),$

i.e. the definition is totally the same. What you have to note in this formula, is that you can calculate everything here, it doesn't matter that $x_1$ is binary!

Yes, the formula will become

$\mathrm{cov}\left(x_4,x_1\right)=\frac{1}{n}\sum_{i=1}^n \left(x_{1i}-\overline{x_1}\right)\left(x_{4i}-\overline{x_4}\right)=\frac{-\overline{x_1}\sum_{x_{1i}=0}\left(x_{4i}-\overline{x_4}\right) + \left(1-\overline{x_1}\right)\sum_{x_{1i}=1}\left(x_{4i}-\overline{x_4}\right)}{n},$

but that is just a technical question (simplification).

Best Answer

Related Solutions

Machine Learning – Multiple Regression with Mixed Continuous/Categorical Variables

Related Question