Solved – Feature Importance in Isolation Forest

anomaly detectionisolation-forestoutliersrandom forestscikit learn

In an unsupervised setting for higher-dimensional data (e.g. 10 variables (numerical and categorical), 5000 samples, ratio of anomalies likely 1% or below but unknown) I am able to fit the isolation forest and retrieve computed anomaly scores (following the original paper and using the implementation in scikit-learn). This gives me a ranking of potential anomalies to consider. However, how would I further assess the validity of these flags? How can I understand which feature has contributed to the anomaly score the most? Feature importance techniques usually applied in random forests do not seem to work in case of the isolation forest.

Interested to hear your thoughts.
Any help is very appreciated.

Best Answer

I believe it was not implemented in scikit-learn because in contrast with Random Forest algorithm, Isolation Forest feature to split at each node is selected at random. So it is not possible to have a notion of feature importance similar to RF.

Having said that, If you are very confident about the results of Isolation Forest classifier and you have a capacity to train another model then you could use the output of Isolation Forest i.e -1/1 values as target-class to train a Random Forest classifier. This will give you feature importance for detecting anomaly.

Please note that I haven't tried this myself, so I can't comment on accuracy of this proposed approach.

Best Answer

Related Solutions

Machine Learning – Encoding of Categorical Variables with High Cardinality

Solved – Isolation forest with categorical data