Feature Engineering – Encoding High-Cardinality Categorical Features with Differing Cardinalities

categorical dataclassificationdimensionality reductionfeature-engineeringmany-categories

I have been looking through questions regarding categorical feature encoding, but couldn't find any which discuss my problem. Apologies if I missed it.


Let's say we have a dataset with binary and nominal variables of roughly equal importance each.

Most classifiers cannot deal with categorical types directly, so these have to be transformed – for example using one-hot encoding (dummy variables) as explained in this answer.

  • If one categorical variable has high cardinality, wouldn't encoding it this way "overpower" other (for example binary) variables? By "cardinality" I mean the number of categories in a nominal variable.

  • If our classifier model is aware of relationships between variables, wouldn't it unnecessarily attempt to find relationships between introduced binary dummy "components" of the same variable?

And if so, how could this be addressed?

The best solution I can think of is to logically group high-cardinality properties into "buckets", however if there are enough unique values to be a problem, then manually grouping them would be labour consuming as well.


Edit:
This is trivial and only partially addresses the problem, but one of the things I ended up doing is replacing all relatively rare categorical values with a new, "other" category. It could be time consuming to optimise the threshold when to consider value "rare", but at least this approach can be automated.

Best Answer

If one categorical variable has high cardinality, wouldn't encoding it this way "overpower" other (for example binary) variables?

It depends on the algorithm.

Algorithms based on the sampling of the columns (random forests, extremely randomized trees, gradient boosting or a bagged classifier...) train a lot of models on subsamples of the data. If 90% of your columns represent a "dummified" variable, it is likely that a large number of the models are actually working on the same variable, therefore, making them more correlated than they should be, thus arming performance.

Linear regression methods will not be affected, they will simply give a weight to every binary variable produced by the encoded variable.

With nearest neighbours and similarity based methods (such as kernel SVMs) the impact should be limited as well. No matter the number of columns, the only thing that matters in the end is the inner product or the distance between two lines of your data. However, the number of columns that stems from a nominal variable, the distance (or inner product) can only be 0 or 1 (the nominal variables were equal or not).

If our classifier model is aware of relationships between variables, wouldn't it unnecessarily attempt to find relationships between introduced binary "components" of the same variable?

How is your classifier "aware" of relationships between variables ? I am not sure I can address this question.

And if so, how could this be addressed?

In the case of any method relying on samples of the columns, prior weights could be given to the columns (so that they are not selected with the same probabilities). However, I do not have any implementations in mind that do this. A quick fix could be to repeat the other columns, so that there likelihood to be selected artificially increases.

Related Question