Solved – How to get top features that contribute to anomalies in Isolation forest

anomaly detectionisolation-forestmachine learning

I am using Isolation forest for anomaly detection on multidimensional data. The algorithm is detecting anomalous records with good accuracy. Apart from detecting anomalous records I also need to find out which features are contributing the most for a data point to be anomalous. Is there any way we can get this?

Best Answer

SHAP values and the shap Python library can be used for this. Shap has built-in support for scikit-learn IsolationForest since October 2019.

import shap
from sklearn.ensemble import IsolationForest

# Load data and train Anomaly Detector as usual 
X_train, X_test, ...
est = IsolationForest()
est.fit(...)

# Create shap values and plot them
X_explain = X_test
shap_values = shap.TreeExplainer(est).shap_values(X_explain)
shap.summary_plot(shap_values, X_explain)

Here is an example of a plot I did for one IsolationForest model that I had, which was time-series. enter image description here

You can also get partial dependence plots for a particular feature, or a plot showing the feature contributions for a single X instance. Examples for this is given in the shap project README.