Solved – How to choose number of dumthe variables when encoding several categorical variables

categorical datacategorical-encodinglogistic

I'm building a logistic regression, and two of my variables are categorical with three levels each. (Say one variable is male, female, or unknown, and the other is single, married, or unknown.)

How many dummy variables am I supposed to make? Do I make 4 in total (2 for each of the categorical variables, e.g., a male variable, a female variable, a single variable, and a married variable) or 5 in total (2 for one of the categorical variables, 3 for the other)?

I know most textbooks say that when you're dummy encoding a categorical variable with k levels, you should only make k-1 dummy variables, since otherwise you'll get a collinearity with the constant. But what do you do when you're dummy encoding several categorical variables? By the collinearity argument, it sounds like I'd only make k-1 dummy variables for one of the categorical variables, and for the rest of the categorical variables I'd build all k dummy variables.

Best Answer

You would make k-1 dummy variables for each of your categorical variables. The textbook argument holds; if you were to make k dummies for any of your variables, you would have a collinearity. You can think of the k-1 dummies as being contrasts between the effects of their corresponding levels, and the level whose dummy is left out.