Regression – Impact of Omitted Dummy Variable Coefficients in OLS

categorical dataregression

I am running an OLS regression using dummy variables built from categorical variables. Say, race became race1, race2 and race3. I omit race1 in order to escape the dummy variable trap and run OLS and gain some coefficients for race2 and race3.

What would be the coefficient for race1 in the regression equation? Or I just should not include the omitted dummy variable into the final equation to estimate the predicted value?

Best Answer

How to use / interpret the coefficients from a regression model with categorical variables to get predicted variables depends on how your variables are coded. There are many different coding schemes (see here for a good overview). It sounds like you used 'reference cell coding', which most people call 'dummy coding'. I gather your race1 category is the reference category. In this case, the intercept is the mean of the race1 group. To compute the predicted value, you would solve the equation using whatever values for other variables apply and omitting the coefficients for the other categories (i.e., race2 & race3). There is some good, relevant info here, and here.

edit: The way the question is phrased made me think about situations in which there is only one factor in the model, however, @Michelle raises the question of the more general case. To keep this relatively simple, imagine a case with just two factors, e.g. race and sex, plus some continuous covariates. Using reference cell coding, we will create a dummy for male. Now, solving the regression equation without including any of the factor coefficients (i.e., just the intercept + continuous covariates) yields the predicted mean of the reference cell, which in this case is the race1 female group. Should you want to know the value for race1 males, you would solve as above, but also include the coefficient for male. If you wanted to ignore sex, or make a prediction for a mixed-sex group, you would calculate a weighted average of the above two predictions. Obviously, this will get more complicated as the number of factors, $J$, increases, but the pattern should be clear enough.

Related Question