Solved – Isolation forest with categorical data

anomaly detectioncategorical datacategorical-encodingisolation-forestoutliers

I understand how isolation forests can work with numeric data, but I wonder how it can work with categorical data?

Also, at least when working with Sci-kit-Learn, the recommendation I saw was to convert a categorical data with one-hot encoding, or something of the like. How can I interpret what is happening in this case?

Best Answer

As far as I see it, the use of categorical data is just useful for isolation forests as long as the data is still ordinal. In this case you can use an OrdinalEncoder to encode the categorical data (and retain the ordering). Then, the algorithm works the same was as for numerical data, since the minimum and maximum values can still be set accordingly.

If the data is not ordinal however, it might be reasonable to use OneHotEncoding since for OrdinalEncoding, isolation forests would assume an ordering that does not exist. Using OneHotEncoding however, the algorithm can still gain information by considering the single values of a feature.