Solved – Partialling or regressing out a categorical variable

regression

Occasionally I see in literature that a categorical variable such as sex is “partialled” or “regressed” out in (fixed-effects or mixed-effects) regression analysis. I'm troubled with the following practical issues involved in such a statement:

(1) Usually the coding method is not mentioned in the paper. Such a variable has to be coded with quantitative values, and I feel the sensible way should be effect coding (e.g., male = 1, female = -1) so that partialling can be achieved with other effects interpreted at the grand mean of both sex groups. A different coding may render a different (and unwanted) interpretation. For example, dummy coding (e.g., male = 0, female = 1) would leave other effects associated with males, not the grand mean. Even centering this dummy-coded variable might not work well for their partialling purpose if there is unequal number of subjects across the two groups. Am I correct?

(2) If the effect of such a categorical variable is included in the model, examining its effects first seems necessary and should be discussed in the context because of its consequence on the interpretation of other effects. What troubles me is that sometimes the authors don't even mention the significance of sex effect, let alone any model building process. If the sex effect exists, a natural follow-up question is whether any interactions exist between sex and other variables in the model? If no sex effect and no interactions exist, sex should be removed from the model.

(3) If sex is considered of no interest to those authors, what is the point of including it in the model in the first place without checking its effects? Does the inclusion of such a categorical variable (and costing one degree of freedom on the fixed effect of sex) gain anything for their partialling purpose when sex effect exists (my limited experience says essentially no)?

Best Answer

I don't think (1) makes any difference. The idea is to partial out from the response and the other predictors the effects of Sex. It doesn't matter if you code 0, 1 (Treatment contrasts) or 1, -1 (Sum to zero contrasts) as the models represent the same "amount" of information which is then removed. Here is an example in R:

set.seed(1)
dat <- data.frame(Size = c(rnorm(20, 180, sd = 5), 
                           rnorm(20, 170, sd = 5)),
                  Sex = gl(2,20,labels = c("Male","Female")))

options(contrasts = c("contr.treatment", "contr.poly"))
r1 <- resid(m1 <- lm(Size ~ Sex, data = dat))
options(contrasts = c("contr.sum", "contr.poly"))
r2 <- resid(m2 <- lm(Size ~ Sex, data = dat))
options(contrasts = c("contr.treatment", "contr.poly"))

From these two models, the residuals are the same and it is this information one would then take into the subsequent model (plus the same thing removing Sex effect form the other covariates):

> all.equal(r1, r2)
[1] TRUE

I happen to agree with (2), but on (3) if Sex is no interest to the researchers, they might still want to control for Sex effects, so my null model would be one that includes Sex and I test alternatives with additional covariates plus Sex. Your point about interactions and testing for effects of the non-interesting variables is an important and valid observation.

Related Question