Machine Learning – Encoding of Categorical Variables with High Cardinality

anomaly detectioncategorical datacategorical-encodingmachine learningmany-categories

For unsupervised anomaly detection / fraud analytics on credit card data (where I don't have labeled fraudulent cases), there are a lot of variables to consider. The data is of mixed type with continuous/numerical variables (e.g. USD amount spent) as well as categorical variables (e.g. account number).

What is the most suitable way of including categorical variables that have a very large number of unique classes? My thoughts so far:

  • Label Encoding (scikit-learn): i.e. mapping integers to classes. While it returns a nice single encoded feature column, it imposes a false sense of ordinal relationship (e.g. 135 > 72).
  • One Hot / Dummy Encoding (scikit-learn): i.e. expanding the categorical feature into lots of dummy columns taking values in {0,1}. This is infeasible for categorical features having e.g. >10,000 unique values. I understand that models will struggle with the sparse and large data.

What other (more advanced?) suitable methods are there to include large categorical feature columns? Is it possible to still use One Hot Encoding with some tricks? I read about bin counting (Microsoft blog) though I haven't found any applications related to intrusion detection / fraud analytics.

P.S.: In my view, this problem seems very similar to encoding an IP-address feature column when dealing with unsupervised intrusion detection.

Best Answer

This link provides a very good summary and should be helpful. As you allude to, label-encoding should not be used for nominal variables at it introduces an artificial ordinality. Hashing is a potential alternative that is particularity suitable for features that have high cardinality.

You can also use a distributed representation, which has become very popular in the deep learning community. The most common example given for distributed representation is word embeddings in NLP. That is not to say you cannot utilise them in encoding other categorical features. Here is an example.

Finally, account number would not be a wise input as it is more a unique identifier rather than a generalisable (account) feature.