Solved – Problems with one-hot encoding vs. dumthe encoding

categorical datamany-categoriesmultiple regressionregression

I am aware of the fact that categorical variables with k levels should be encoded with k-1 variables in dummy encoding (similarly for multi-valued categorical variables). I was wondering how much of a problem does a one-hot encoding (i.e. using k variables instead) over dummy encoding for different regression methods, mainly linear regression, penalized linear regression (Lasso, Ridge, ElasticNet), tree-based (random forests, gradient boosting machines).

I know that in linear regression, multi-collinearity problems occur (even though in practice I have fitted linear regression using OHE without any issues).

However, does dummy encoding need to be used in all of them and how wrong would the results be if one uses one-hot encoding?

My focus is on prediction in regression models with multiple (high-cardinality) categorical variables, so I am not interested in confidence intervals.

Best Answer

The issue with representing a categorical variable that has $k$ levels with $k$ variables in regression is that, if the model also has a constant term, then the terms will be linearly dependent and hence the model will be unidentifiable. For example, if the model is $μ = a_0 + a_1X_1 + a_2X_2$ and $X_2 = 1 - X_1$, then any choice $(β_0, β_1, β_2)$ of the parameter vector is indistinguishable from $(β_0 + β_2,\; β_1 - β_2,\; 0)$. So although software may be willing to give you estimates for these parameters, they aren't uniquely determined and hence probably won't be very useful.

Penalization will make the model identifiable, but redundant coding will still affect the parameter values in weird ways, given the above.

The effect of a redundant coding on a decision tree (or ensemble of trees) will likely be to overweight the feature in question relative to others, since it's represented with an extra redundant variable and therefore will be chosen more often than it otherwise would be for splits.