Solved – Dumthe variable trap issues

categorical data

I am running a large OLS regression where all the independent variables (around 400) are dummy variables. If all are included, there is perfect multicollinearity (the dummy variable trap), so I have to omit one of the variables before running the regression.

My first question is, which variable should be omitted? I have read that it is better to omit a variable that is present in many of the observations rather than one that is present in only a few (e.g. if almost all observations are "male" or "female" and just a few are "unknown", omit either "male" or "female"). Is this justified?

After running the regression with a variable omitted, I am able to estimate the coefficient value of the omitted variable because I know that the overall mean of all my independent variables should be 0. So I use this fact to shift the coefficient values for all the included variables, and get an estimate for the omitted variable. My next question is whether there is some similar technique that can be used to estimate the standard error for the coefficient value of the omitted variable. As it is I have to re-run the regression omitting a different variable (and including the variable I had omitted in the first regression) in order to acquire a standard error estimate for the coefficient of the originally omitted variable.

Finally, I notice that the coefficient estimates I get (after re-centering around zero) vary slightly depending on which variable is omitted. In theory, would it be better to run several regressions, each omitting a different variable, and then average the coefficient estimates from all the regressions?

Best Answer

You should get the "same" estimates no matter which variable you omit; the coefficients may be different, but the estimates of particular quantities or expectations should be the same across all the models.

In a simple case, let $x_i=1$ for men and 0 for women. Then, we have the model: $$\begin{align*} E[y_i \mid x_i] &= x_iE[y_i \mid x_i = 1] + (1 - x_i)E[y_i \mid x_i = 0] \\ &= E[y_i \mid x_i=0] + \left[E[y_i \mid x_i= 1] - E[y_i \mid x_i=0]\right]x_i \\ &= \beta_0 + \beta_1 x_i. \end{align*}$$ Now, let $z_i=1$ for women. Then $$\begin{align*} E[y_i \mid z_i] &= z_iE[y_i \mid z_i = 1] + (1 - z_i)E[y_i \mid z_i = 0] \\ &= E[y_i \mid z_i=0] + \left[E[y_i \mid z_i= 1] - E[y_i \mid z_i=0]\right]z_i \\ &= \gamma_0 + \gamma_1 z_i . \end{align*}$$ The expected value of $y$ for women is $\beta_0$ and also $\gamma_0 + \gamma_1$. For men, it is $\beta_0 + \beta_1$ and $\gamma_0$.

These results show how the coefficients from the two models are related. For example, $\beta_1 = -\gamma_1$. A similar exercise using your data should show that the "different" coefficients that you get are just sums and differences of one another.

Related Solutions

Regression – Impact of Omitted Dummy Variable Coefficients in OLS

How to use / interpret the coefficients from a regression model with categorical variables to get predicted variables depends on how your variables are coded. There are many different coding schemes (see here for a good overview). It sounds like you used 'reference cell coding', which most people call 'dummy coding'. I gather your race1 category is the reference category. In this case, the intercept is the mean of the race1 group. To compute the predicted value, you would solve the equation using whatever values for other variables apply and omitting the coefficients for the other categories (i.e., race2 & race3). There is some good, relevant info here, and here.

edit: The way the question is phrased made me think about situations in which there is only one factor in the model, however, @Michelle raises the question of the more general case. To keep this relatively simple, imagine a case with just two factors, e.g. race and sex, plus some continuous covariates. Using reference cell coding, we will create a dummy for male. Now, solving the regression equation without including any of the factor coefficients (i.e., just the intercept + continuous covariates) yields the predicted mean of the reference cell, which in this case is the race1 female group. Should you want to know the value for race1 males, you would solve as above, but also include the coefficient for male. If you wanted to ignore sex, or make a prediction for a mixed-sex group, you would calculate a weighted average of the above two predictions. Obviously, this will get more complicated as the number of factors, $J$, increases, but the pattern should be clear enough.

Solved – Interpretation of logistic regression intercept with one dumthe coded categorical variable

I think you are making this hard on yourself. Make sure race is a factor variable so that the software provides the overall $\chi^2$ of association with $k-1$ d.f. for $k$ categories. Coding doesn't affect the value of $\chi^2$. Don't use a stepwise process for making inference about the importance of race. Use the overall "chunk" test as described above, which has a built-in perfect multiplicity adjustment besides being invariant to coding. In R this would look like (for a binary or ordinal logistic model predicting $Y$):

require(rms)
f <- lrm(Y ~ rcs(age, 4) + race)
anova(f)   # 3 d.f. test for age, k-1 for race
# also prints 2 d.f. test of linearity in age
# age fit is restricted cubic spline with 4 default knots

When doing multiple imputation with the Hmisc package aregImpute function or with the mice package, you would substitute the following for the 2nd line above:

f <- fit.mult.impute(Y ~ rcs(age, 4) + race, lrm, impute_object, n.impute=20)

which would adjust the covariance matrix for multiple imputation [n.impute recommended to be the percent of observations that have any variable missing].

Best Answer

Related Solutions

Regression – Impact of Omitted Dummy Variable Coefficients in OLS

Solved – Interpretation of logistic regression intercept with one dumthe coded categorical variable

Related Question