Logistic Regression in R – How to Handle Categorical Variables with One-Hot Encoding

categorical datacategorical-encodinglogisticrregression

I created a logistic regression model in R and fit the model using the MumIn package. I have several categorical variables that were coded as factors. For example, season (summer, fall, winter, spring), and color (brown, tan, white). The regression seemed to work fine – I didn't have any warnings or errors, but I recently stumbled across one-hot encoding, and I am wondering if I need to re-code the factors. Is one-hot encoding necessary for all non-ordinal categorical variables? How would one-hot encoding change how the variables are analyzed in the model?

Best Answer

Presumably the package you use builds a design matrix using the built-in functions of R. These do dummy coding for factors, which is almost one-hot encoding, but one class is used as a reference class. This means that for $n$ classes there will be $n-1$ binary indicator variables. For the reference class all these are 0. For any other class a single indicator will be 1 and the rest 0.

It is not advisable to use, say, the integer values assigned to a factor coding directly in a model. Imagine we have a factor variable for color: yellow is 1, green is 2, red is 3. These numbers imply that red is somehow "2 more than" yellow, which is nonsense. In this sense you need one-hot encoding or something like it to deal with an unordered classification.

Related Question