Solved – Coding categorical variables for linear regression and random forest, factors/characters

categorical datacategorical-encodinglinear modelmultiple regressionregression

I am a newbie in Data Science so that do not judge me for this questions.

Making a regression model (linear model, lm_model) with numeric and categorical variables, I realized that Estimate coefficients in summary (lm_model) are the same for categorical variables whether categorical variables are factors or characters.

  1. So, what is the preferred method to use for modeling (regression or classification)? As I read before, factors are preferred, but in my case, I actually do not see any difference.

  2. In some cases, people say, if categorical variables have many levels (more than 32 levels), it is not possible to use a linear model or random forest model. What are the solutions for this problem, except splitting categorical variables into small groups? Is it ok to convert factors into characters?

  3. Also, I am a bit of stuck in this question. What algorithms are required to transform categorical variables into numeric (or dummy variables)? Or is it not necessary? As I know, algorithms such as linear regression, logistic regression are required; on the other hand, trees algorithms like random forest are not required. But using caret package in R, I noticed that a linear model with categorical variables as factors or even characters runs pretty well and I can see Estimate coefficients. What is a preferred method for this case?

  4. What are real constraints to use dummy variables? Of course, it depends on data, but is there a rule when it is not recommended to use it?

Thank you very much in advance!!

Best Answer

First, characters are turned into factors in what the algorithm is doing behind the scenes. To be more specific, characters or factors are both just turned into dummy-coded variables. If you are using the R language, you can run:

dat <- data.frame(
  y = rnorm(99),
  x = c("a", "b", "c"),
  stringsAsFactors = FALSE
)

If you look at str(dat$x), you'll see that it says it is a character. But the lm function that does linear models will turn this character vector into a design matrix underneath the hood. It does this using the model.matrix() command. You can run this manually by giving it the terms of your lm (that is, just the right side of the ~ and the data): model.matrix(~ x, dat). This will return the dummy-coded variable x and an intercept. The same thing happens if you make stringsAsFactors = TRUE. If you run the following code, you'll see that it returns TRUE, noting that both of the model matrices are identical:

set.seed(1839)
dat1 <- data.frame(
  y = rnorm(99),
  x = c("a", "b", "c"),
  stringsAsFactors = FALSE
)
set.seed(1839)
dat2 <- data.frame(
  y = rnorm(99),
  x = c("a", "b", "c"),
  stringsAsFactors = TRUE
)
identical(
  model.matrix(~ x, dat1),
  model.matrix(~ x, dat2)
)

This means that characters and factors are the same things. Any other statistical software will do the same thing.

Second, when you talk about "regression or classification," this refers to the dependent variable; what you are talking about, it seems, is the independent variable. Broadly speaking, regression is used when the dependent variable is numeric, while classification is used when it is categorical.

Third, you need to figure out perhaps a better way to model a factor with 32 categories. Is there a big enough sample size in each of the categories that you can actually run the model? Do you care about interpretability of coefficients? If the answer to those questions are yes and no, respectively, then you can use a factor with 32 categories. If not, you could find ways to collapse these 32 categories into fewer. Or you could do a multilevel/hierarchical model and treat these 32 categories as groups. This is dependent on the situation.

Fourth, you do not need an algorithm to transform a character/factor variable into dummy-coded variables. Often the question in doing the analyses in applied settings is: "Is the function smart enough to do it for me, or do I need to make them myself first?" It will depend on the package you use. If you need to make them beforehand, there are a lot of packages that will help you (I tend to just use the model.matrix function).