Solved – Interpretation of dummies (several variables and categories!)

categorical-encodinginterpretationmultiple regression

I am trying to figure out how to interpret the coefficients when using several qualitative independent variables of which some have more than two categories. I found this short article that touches upon that issue (the one page preview is enough to understand what I mean but I will explain here too). The article says that the interpretation is messed up in this case. Let me use the example variables of the article to make it clear what I mean. Wage is my dependent variable. The independent variables are all dummies that can be grouped in:

educational attainment

E1 = postgraduate (value 1 if postgraduate, 0 otherwise)

E2 = bachelor (value 1 if bachelor's degree, 0 otherwise)

E3 = high school (value 1 if high school, 0 otherwise)

marital status

R1 = married/in relationship (value 1 if married, 0 otherwise)

R2 = divorced (value 1 if divorced, 0 otherwise)

R3 = single (value 1 if single, 0 otherwise)

sex

M = male (value 1 if male, 0 otherwise)

F = female (value 1 if female, 0 otherwise)

Now, to avoid the dummy trap I delete E3, R3 and F. My regression equation would then look like this:

wage = b0 + b1*E1 + b2*E2 + b3*R1 + b4*R2 + b5*M

The article states that the baseline to compare against would thus be a single female that went to high school and that would mess up interpreting the individual dummy coefficients. My questions:

  1. Do I understand it correctly that it is indeed using several qualitative variables that have more than 2 categories (in this case educational attainment and marital status have each 3) that messes up the interpretation? I get that if I for example only used the marital status dummies (again excluding R3 to not face the dummy trap), I could easily state that married people (R1) earn more or less compared to single people (R3). But is this no longer possible as soon as I include the educational attainment dummies in my regression as well?
  2. Are dummy variables that only have 2 categories and therefore are represented in the regression equation with 1 dummy only (e.g. my sex variable -> M in the equation) affected by this at all or could I still state that men earn more or less compared to females, despite having educational attainment and marital status in my equation?

Best Answer

When you have more than one independent variable, continuous or categorical and however many categories, all the parameter estimates are controlling for the other variables. Another way of saying "controlling" is "holding constant".

So, if (e.g.) the parameter estimate for E1 is positive and large, you can say:

"Other things being equal, people with graduate degrees earn more than people with less education"

if the parameter estimate for "M" is positive and large you can, similarly, say "Holding other variables constant, men earned more than women".

One note- although you didn't ask about it, models for wage or income often use log(wage) for two reasons: 1) Models on wage often have non-normal residuals and 2) Changes in wage are often better conceptualized as multiplicative rather than additive. E.g. Going from a wage of \$10,000/year to \$20,000/year is not like going from \$100,000/year to \$110,000/year; it's more like going from \$100,000/year to \$200,000/year.