Linear Regression in R – Using Dummy Variables Without Manual Encoding

categorical-encodingrregression

A source of confusion that I often come across relates to when people want to use categorical data, where the number of categories is greater than 2, in a linear regression (simple or multiple) and think that they must manually code dummy variables in R representing the categories while following the rule of n – 1 to avoid the dummy variable trap.

However, doesn't R by default do the dummy encoding under the hood when using a categorical variable with the lm() function? And isn't this the case whether the variable is a factor or a character string? The reference category should really be defined but R uses 0 if the factor variable has numeric levels and alphabetic ordering (A or closest to A) if the factor variable has character string levels. If the variable is simply a character string vector type, then R uses the same alphabetically defined reference by default.

Best Answer

Not exactly. Think about it. How would R (or any tool) guess whether the vector c(0,0,1,1,2,2) encodes categories coded as 0,1,2 or a continuous variable, say time in days, in the range from 0 to 2? So when the variable is integer or numeric, R treats it as continuous.

When the variable is a factor or a character vector, R does dummy encoding "under the hood" as you say. By default, it picks one level as the reference and it creates a 0-1 indicator variable for each of the other levels. The default ordering of the non-reference levels is not always intuitive.

It's always a good idea to double check what R guesses on your behalf. To do this, use the model.matrix function which returns the design matrix.

library("tidyverse")

X <- tibble(x = c(0, 0, 5, 5, 10, 10))

model.matrix(~x, data = X %>% mutate(x = as.integer(x)))
#>   (Intercept)  x
#> 1           1  0
#> 2           1  0
#> 3           1  5
#> 4           1  5
#> 5           1 10
#> 6           1 10
#> attr(,"assign")
#> [1] 0 1
model.matrix(~x, data = X %>% mutate(x = as.numeric(x)))
#>   (Intercept)  x
#> 1           1  0
#> 2           1  0
#> 3           1  5
#> 4           1  5
#> 5           1 10
#> 6           1 10
#> attr(,"assign")
#> [1] 0 1
# The levels are ordered alphabetically
model.matrix(~x, data = X %>% mutate(x = as.character(x)))
#>   (Intercept) x10 x5
#> 1           1   0  0
#> 2           1   0  0
#> 3           1   0  1
#> 4           1   0  1
#> 5           1   1  0
#> 6           1   1  0
#> attr(,"assign")
#> [1] 0 1 1
#> attr(,"contrasts")
#> attr(,"contrasts")$x
#> [1] "contr.treatment"
model.matrix(~x, data = X %>% mutate(x = as.factor(x)))
#>   (Intercept) x5 x10
#> 1           1  0   0
#> 2           1  0   0
#> 3           1  1   0
#> 4           1  1   0
#> 5           1  0   1
#> 6           1  0   1
#> attr(,"assign")
#> [1] 0 1 1
#> attr(,"contrasts")
#> attr(,"contrasts")$x
#> [1] "contr.treatment"

Created on 2022-03-23 by the reprex package (v2.0.1)