Solved – Reasons not to one-hot-encode categorical features

categorical datacategorical-encoding

I overheard a colleague discussing her strategy for using categorical features the other day, and she mentioned that instead of one-hot-encoding, she does something like this:

cat_string    cat_num
dog                 0
cat                 1
dog                 0
dog                 0
horse               2

Then she feeds cat_num (along with other inputs, potentially) into her model(s), without one-hot-encoding.

Now, there's obviously no issue if she wanted to assign these numbers to the categories and one-hot-encode – inputs created by OHEing cat_string versus cat_num would be identical, and she would actually be doing a sort of file compression if she dropped cat_string and just preserved that mapping – so nothing bad about that.

I could see not one-hot-encoding cat_num maybe being fine in two other cases

1) Where there is some natural ordering to the categories, e.g. low, medium, and high, and that ordering is reflected in the integer encoding applied, e.g. low: 0, medium: 1, and high: 2

2) Where cat_num has relatively low cardinality, no inherent ordering, and you're using a tree-based models – the idea there being that, with a well-specified tree, your tree might identify reasonable split points and effectively "learn" the one-hot-encoding, in a way. But that seems to be poor practice, when one-hot-encoding looks to be the best-practice way of doing things. OHEing, IMO and in general, makes things much easier to understand for these types of scenarios, and is trivial if you have enough computing power. I don't think this second approach would scale to high-cardinality categorical features with no inherent ordering, as learning "important" splits like cat_num <= 1000 seems like nonsense. Especially with a linear model, measuring the effect on the output from a "one-unit increase in cat_num", holding all else constant, is nonsense.

Can anyone else offer insight to this question?

Best Answer

A note on terminology: As far as I am aware (unfortunately, there are a lot of blogs written by people who overlook the subtle differences and thus mis-information spreads):

One hot encoding is exactly what you described, generating a map from each unique value in a string column to an integer

Dummying is making K new columns (in which K is the number of unique values), of which exactly one column per row must be one.

In the "dog, cat, horse" example, when using a decision tree, consider the following example. Perhaps your target variable is "has it ever meowed?". Clearly what you want your decision tree to do is be able to ask the question "is it a cat? (yes/no)".

If you one-hot encode, such that dog -> 0, cat-> 1, horse->2, the tree can't isolate all of the cats using one question, because decision trees always split using "is feature x greater than or less than X?"

If you're using logistic regression, it also can't assign higher probabilities of meowing to cats.

If you dummy, the tree can explicitly ask the question "the column which signifies cat greater than 0.5?", thus splitting your data into cats and not cats.

If you use logistic regression, your optimiser can learn that the coefficient related to this column should be positive.

Thus in my opinion, whenever you have categorical data which has no implicit ordinality, always dummy, never one-hot encode.

In the case where your data has high cardinality, this could cause problems, especially if the number of examples of each type is tiny, but this is a problem you can't really solve, you simply have too detailed information for the size of your training data and using it would lead to over-fitting.

Nonetheless, one way to mitigate this, is to do some manual clustering (or actual clustering), in which you make a synthetic column, which can take fewer values, and many of the unique values of the original column map to the same value in the new column (e.g. dog, cat, horse-> mammal, pigeon, parrot , chicken -> bird). This makes it easier for the algorithm to learn, and if there's enough data, it can split further within each cluster.