A source of confusion that I often come across relates to when people want to use categorical data, where the number of categories is greater than 2, in a linear regression (simple or multiple) and think that they must manually code dummy variables in R representing the categories while following the rule of n – 1 to avoid the dummy variable trap.
However, doesn't R by default do the dummy encoding under the hood when using a categorical variable with the lm()
function? And isn't this the case whether the variable is a factor or a character string? The reference category should really be defined but R uses 0 if the factor variable has numeric levels and alphabetic ordering (A or closest to A) if the factor variable has character string levels. If the variable is simply a character string vector type, then R uses the same alphabetically defined reference by default.
Best Answer
Not exactly. Think about it. How would R (or any tool) guess whether the vector c(0,0,1,1,2,2) encodes categories coded as 0,1,2 or a continuous variable, say time in days, in the range from 0 to 2? So when the variable is integer or numeric, R treats it as continuous.
When the variable is a factor or a character vector, R does dummy encoding "under the hood" as you say. By default, it picks one level as the reference and it creates a 0-1 indicator variable for each of the other levels. The default ordering of the non-reference levels is not always intuitive.
It's always a good idea to double check what R guesses on your behalf. To do this, use the
model.matrix
function which returns the design matrix.Created on 2022-03-23 by the reprex package (v2.0.1)