Solved – Dumthe variable trap issues

categorical data

I am running a large OLS regression where all the independent variables (around 400) are dummy variables. If all are included, there is perfect multicollinearity (the dummy variable trap), so I have to omit one of the variables before running the regression.

My first question is, which variable should be omitted? I have read that it is better to omit a variable that is present in many of the observations rather than one that is present in only a few (e.g. if almost all observations are "male" or "female" and just a few are "unknown", omit either "male" or "female"). Is this justified?

After running the regression with a variable omitted, I am able to estimate the coefficient value of the omitted variable because I know that the overall mean of all my independent variables should be 0. So I use this fact to shift the coefficient values for all the included variables, and get an estimate for the omitted variable. My next question is whether there is some similar technique that can be used to estimate the standard error for the coefficient value of the omitted variable. As it is I have to re-run the regression omitting a different variable (and including the variable I had omitted in the first regression) in order to acquire a standard error estimate for the coefficient of the originally omitted variable.

Finally, I notice that the coefficient estimates I get (after re-centering around zero) vary slightly depending on which variable is omitted. In theory, would it be better to run several regressions, each omitting a different variable, and then average the coefficient estimates from all the regressions?

Best Answer

You should get the "same" estimates no matter which variable you omit; the coefficients may be different, but the estimates of particular quantities or expectations should be the same across all the models.

In a simple case, let $x_i=1$ for men and 0 for women. Then, we have the model: $$\begin{align*} E[y_i \mid x_i] &= x_iE[y_i \mid x_i = 1] + (1 - x_i)E[y_i \mid x_i = 0] \\ &= E[y_i \mid x_i=0] + \left[E[y_i \mid x_i= 1] - E[y_i \mid x_i=0]\right]x_i \\ &= \beta_0 + \beta_1 x_i. \end{align*}$$ Now, let $z_i=1$ for women. Then $$\begin{align*} E[y_i \mid z_i] &= z_iE[y_i \mid z_i = 1] + (1 - z_i)E[y_i \mid z_i = 0] \\ &= E[y_i \mid z_i=0] + \left[E[y_i \mid z_i= 1] - E[y_i \mid z_i=0]\right]z_i \\ &= \gamma_0 + \gamma_1 z_i . \end{align*}$$ The expected value of $y$ for women is $\beta_0$ and also $\gamma_0 + \gamma_1$. For men, it is $\beta_0 + \beta_1$ and $\gamma_0$.

These results show how the coefficients from the two models are related. For example, $\beta_1 = -\gamma_1$. A similar exercise using your data should show that the "different" coefficients that you get are just sums and differences of one another.

Related Question