Solved – Strange encoding for categorical features

anomaly detectioncategorical-encodingfeature selectionfeature-engineeringisolation-forest

I am reading through https://arxiv.org/pdf/1609.06676.pdf which presents an extension of the isolation forest algorithm so that categorical features may be taken into account. On page 5, the authors note:

… we extend the algorithm to consider categorical data. Our method only requires that for each categorical dimension, values have an
ordering. The ordering may be arbitrary. Each value is then mapped to
a numeric value, based on its ordering. For example the values true
and false may be mapped to false = 0, true = 1. Having mapped the
categorical values to numeric values, the categorical dimensions can
be treated the same way as the numeric dimensions in the iForest
algorithm.

Does this approach make sense?

At first I thought, doesn't this produce the exact same result as applying Scikit-Learn's LabelEncoder()? However, the authors seem to do it without creating a unique set before ordering. A different way would be One-Hot-Encoding, though this blows the feature space up very quickly for high-cardinal categorical features.

Best Answer

Yes, this sounds like label-encoding (a machine-learning term I never encountered in Statistics) and doesn't make much sense for unordered categorical variables. If the algorithm cannot cope with dummys, maybe try some variant of target/mean encoding (mentioned here).

Use first some linear model (maybe glmnet) with regularization appropriate for a categorical variable with many levels, see Principled way of collapsing categorical variables with many levels?, and then encode the categorical variable with the estimated coefficients for that variable from the linear model? That at least should be worth a try.

Related Question