Linear Regression in R – Using Dummy Variables Without Manual Encoding

categorical-encodingrregression

A source of confusion that I often come across relates to when people want to use categorical data, where the number of categories is greater than 2, in a linear regression (simple or multiple) and think that they must manually code dummy variables in R representing the categories while following the rule of n – 1 to avoid the dummy variable trap.

However, doesn't R by default do the dummy encoding under the hood when using a categorical variable with the lm() function? And isn't this the case whether the variable is a factor or a character string? The reference category should really be defined but R uses 0 if the factor variable has numeric levels and alphabetic ordering (A or closest to A) if the factor variable has character string levels. If the variable is simply a character string vector type, then R uses the same alphabetically defined reference by default.

Best Answer

Not exactly. Think about it. How would R (or any tool) guess whether the vector c(0,0,1,1,2,2) encodes categories coded as 0,1,2 or a continuous variable, say time in days, in the range from 0 to 2? So when the variable is integer or numeric, R treats it as continuous.

When the variable is a factor or a character vector, R does dummy encoding "under the hood" as you say. By default, it picks one level as the reference and it creates a 0-1 indicator variable for each of the other levels. The default ordering of the non-reference levels is not always intuitive.

It's always a good idea to double check what R guesses on your behalf. To do this, use the model.matrix function which returns the design matrix.

library("tidyverse")

X <- tibble(x = c(0, 0, 5, 5, 10, 10))

model.matrix(~x, data = X %>% mutate(x = as.integer(x)))
#>   (Intercept)  x
#> 1           1  0
#> 2           1  0
#> 3           1  5
#> 4           1  5
#> 5           1 10
#> 6           1 10
#> attr(,"assign")
#> [1] 0 1
model.matrix(~x, data = X %>% mutate(x = as.numeric(x)))
#>   (Intercept)  x
#> 1           1  0
#> 2           1  0
#> 3           1  5
#> 4           1  5
#> 5           1 10
#> 6           1 10
#> attr(,"assign")
#> [1] 0 1
# The levels are ordered alphabetically
model.matrix(~x, data = X %>% mutate(x = as.character(x)))
#>   (Intercept) x10 x5
#> 1           1   0  0
#> 2           1   0  0
#> 3           1   0  1
#> 4           1   0  1
#> 5           1   1  0
#> 6           1   1  0
#> attr(,"assign")
#> [1] 0 1 1
#> attr(,"contrasts")
#> attr(,"contrasts")$x
#> [1] "contr.treatment"
model.matrix(~x, data = X %>% mutate(x = as.factor(x)))
#>   (Intercept) x5 x10
#> 1           1  0   0
#> 2           1  0   0
#> 3           1  1   0
#> 4           1  1   0
#> 5           1  0   1
#> 6           1  0   1
#> attr(,"assign")
#> [1] 0 1 1
#> attr(,"contrasts")
#> attr(,"contrasts")$x
#> [1] "contr.treatment"

^{Created on 2022-03-23 by the reprex package (v2.0.1)}

Related Solutions

Solved – How to encode factors as dumthe variables when using stepPlr

See the first example given in help for step.plr

n <- 100
p <- 3
z <- matrix(sample(seq(3),n*p,replace=TRUE),nrow=n)
x <- data.frame(x1=factor(z[ ,1]),x2=factor(z[ ,2]),x3=factor(z[ ,3]))
y <- sample(c(0,1),n,replace=TRUE)
fit <- step.plr(x,y)
# 'level' is automatically generated. Check 'fit$level'.

Does that answer your question?

Solved – Regression using dumthe variables

I am not a big fan of converting a continuous variable to multiple dummy variables. I guess the binning procedure is considered standard practice in score card development.

Regarding dummy variable insignificance: When you add a dummy variable in regression, the omitted group act as reference group. The reference group is compared to other groups corresponding to the dummy variables. When variables have a nonlinear relationship (e.g. quadratic) with log odds, you may get some dummy variables that are insignificant (the group whose effect is near to the reference group). My suggestion to see the pattern of log-odds in each bin before merging. Either you can make fewer final bins depending one the pattern or change the reference group. I know it is bit abstract. But, I will not be able to go to specific without knowing the case.

You could also drop the insignificant variable. Doing it this way, you are merging the group associated with dropping dummy. It may not be appropriate if the merging of reference group and the dummy group (insignificant) doesn't make business sense.

Best Answer

Related Solutions

Solved – How to encode factors as dumthe variables when using stepPlr

Solved – Regression using dumthe variables

Related Question