Solved – What algorithms require one-hot encoding

categorical datacategorical-encodingdata preprocessingmachine learning

I'm never sure when to use one-hot encoding for non-ordered categorical variables and when not to. I use it whenever the algorithm uses a distance metric to compute similarity. Can anyone give a general rule of thumb as to what types of algorithms would require non-ordered categorical features to be one-hot-encoded and which ones wouldn't?

Best Answer

Most algorithms (linear regression, logistic regression, neural network, support vector machine, etc.) require some sort of the encoding on categorical variables. This is because most algorithms only take numerical values as inputs.

Algorithms that do not require an encoding are algorithms that can directly deal with joint discrete distributions such as Markov chain / Naive Bayes / Bayesian network, tree based, etc.

Additional comments:

One hot encoding is one of the encoding methods. Here is a good resource for categorical variable encoding (not limited to R). R LIBRARY CONTRAST CODING SYSTEMS FOR CATEGORICAL VARIABLES
Even without encoding, distance between data points with discrete variables can be defined, such as hamming distance or Levenshtein Distance

Best Answer

Related Solutions

Solved – Dropping one of the columns when using one-hot encoding

Solved – Do I use dumthe encoding or one hot encoding when trying to do regression

Related Question