Solved – What algorithms require one-hot encoding

categorical datacategorical-encodingdata preprocessingmachine learning

I'm never sure when to use one-hot encoding for non-ordered categorical variables and when not to. I use it whenever the algorithm uses a distance metric to compute similarity. Can anyone give a general rule of thumb as to what types of algorithms would require non-ordered categorical features to be one-hot-encoded and which ones wouldn't?

Best Answer

Most algorithms (linear regression, logistic regression, neural network, support vector machine, etc.) require some sort of the encoding on categorical variables. This is because most algorithms only take numerical values as inputs.

Algorithms that do not require an encoding are algorithms that can directly deal with joint discrete distributions such as Markov chain / Naive Bayes / Bayesian network, tree based, etc.

Additional comments: