Categorical Features – Where to Find a Guide for Encoding Categorical Features?

categorical datacategorical-encodingfeature-engineeringmany-categoriesreferences

I am facing an ML task with various categorical variables. Some examples include the following:

  • Binary variables (0,1).
  • Multilevel factors that can be ordered (low, medium, high).
  • Multilevel factors that cannot be ordered (red, blue, green).

I'm going to use a deep neural network to perform my classification task: however, I'd like to find a possibly comprehensive guide to encoding categorical features according to the most commonly used methods and current best practices. Is there a reference text?

I'm asking this since I usually need to do similar encoding tasks and I usually apply some rough fix, which is probably suboptimal, but that works more or less fine. This approach is, however, not an option in some cases and I would like to improve the way I handle such encoding tasks.

Best Answer

Binary variables

No encoding is needed: use them as is.

Nominal data

When you have an variable that can take on a finite number of values, that's called a categorical variable. When the values can't be ordered (e.g., red, blue, green), that's called a nominal variable. A nominal variable is one kind of categorical variable.

For nominal variables, the usual way to encode them is with a one-hot encoding. If there are $N$ possible values for the variable, you map each value to a $N$-vector that has a $1$ in the position corresponding to that value and $0$ elsewhere.

For instance: red $\mapsto (1,0,0)$, blue $\mapsto (0,1,0)$, green $\mapsto (0,0,1)$.

Ordinal data

When you have a categorical variable where the values can be ordered (sorted), but the ordering doesn't imply anything about how much they differ, that's called a ordinal variable (see ordinal data).

For example, suppose you have a ranking: John finished in 3rd place, Jane in 6th place. You know that John finished before Jane, but that doesn't necessarily mean that John was $6/3=2$ times as fast as Jane.

You can encode ordinal data using the thermometer trick. If there are $N$ possible values for the variable, then you map each value to a $N$-vector, where you put a $1$ in the position that matches the value of the variable and all subsequent position.

For instance: first place $\mapsto (1,1,1)$, second place $\mapsto (0,1,1)$, third place $\mapsto (0,0,1)$.

You can also apply binning if $N$ is too large, but usually it's better not to do that.

Numerical variables

Finally, you may encounter variables that directly measure a number, and where they can be not only ordered, but also subtracted or divided. Then, it's typically best to use the number directly, or possibly use the logarithm of the number. (You might take the logarithm if the number represents a ratio, or if there is a very wide range of values.)

Useful background

To understand these terms, it's helpful to learn about "level of measurements": https://en.wikipedia.org/wiki/Level_of_measurement.

Scaling

Finally, when you're using neural networks or "deep learning", you'll normally want to standardize/rescale all numerical attributes before applying deep learning. I suggest you treat that as a separate process from the feature mappings mentioned above, to be performed after you apply the feature mapping.