Machine Learning – Is One Hot Encoding Preferable Only with Feature Multiplication by Coefficient?

categorical datacategorical-encodingfeature-engineeringmachine learningmodel comparison

Suppose you have linear model and a single feature named "color" (for the sake of simplicity). In linear model you look for a coefficient $\theta_1$ which is going to multiply this feature $x$ in your hypothesis function $h\left(x\right) = \theta_1x$ + $\theta_2$. Likewise if you had something like neural network or logistic regression you would look for a coefficient $\theta_1$ which is going to multiply this color featue in the hypothesis function $h\left(x\right) = \mathrm{sigm}(\theta_1x$ + $\theta_2)$.

So if your colors are encoded using numbers $1$ and $2$, then it doesn't make sense if the red color results in $\theta \cdot 1$ and the blue color results in $\theta \cdot 2$ whatever that $\theta$ is.

My question: Is one hot encoding preferable only in such models where you multliply the feature by some coefficient? For example does it matter which encoding to use in random forest? (I'm not sure but as I know when you calculate entropy you don't multiply features by coefficients in the way shown above)

Best Answer

One-hot encoding ensures that no implicit order is imposed on the feature while integer/label encoding benefits from it. If there is no inherent ordering, the usual approach is one-hot encoding, however sometimes (e.g. in high cardinality) other options can be preferred.

How you encode your features always matters because it changes the model's behavior. For example, in random forests, or simply decision trees, with label encoder it's possible to split the samples into two categories where one side is say Red, Blue and the other side is Green, Yellow if they are ordered as Red, Blue, Green, Yellow (i.e. 1,2,3,4), by splitting wrt value $2.5$. This is not possible in one split with one-hot encoding. However, it may or may not make sense doing this in the context of the problem. This naturally affects the branching of your tree(s) because of hyper-parameters like max depth.

Therefore, we can't say that OHE only matters for models where you multiply your features with coefficients.

Related Question