Solved – How to get top features that contribute to anomalies in Isolation forest

anomaly detectionisolation-forestmachine learning

I am using Isolation forest for anomaly detection on multidimensional data. The algorithm is detecting anomalous records with good accuracy. Apart from detecting anomalous records I also need to find out which features are contributing the most for a data point to be anomalous. Is there any way we can get this?

Best Answer

SHAP values and the shap Python library can be used for this. Shap has built-in support for scikit-learn IsolationForest since October 2019.

import shap
from sklearn.ensemble import IsolationForest

# Load data and train Anomaly Detector as usual 
X_train, X_test, ...
est = IsolationForest()
est.fit(...)

# Create shap values and plot them
X_explain = X_test
shap_values = shap.TreeExplainer(est).shap_values(X_explain)
shap.summary_plot(shap_values, X_explain)

Here is an example of a plot I did for one IsolationForest model that I had, which was time-series.

You can also get partial dependence plots for a particular feature, or a plot showing the feature contributions for a single X instance. Examples for this is given in the shap project README.

Related Solutions

Solved – Does Isolation Forest need an anomaly sample during training

First, some notation: let $x=(x_1, x_2)$ be the features that vary within the normal data, and $z$ be all the other features (that equal zero in the normal data).

Your model did not use $z$, because $z$ was useless for the training data, because it contained no variation. If you want to force $z$ into the model, you can add small random noize to it: $\tilde{z}:=z+\varepsilon$. The variance of $\varepsilon$ may be on the same scale with your measurement error, or just represent your notion of how large $z$ is still "normal". A quick illustration:

import numpy as np
from sklearn.datasets import load_iris
from sklearn.ensemble import IsolationForest
from sklearn.metrics import roc_auc_score
X, _ = load_iris(return_X_y=True)
target = np.array([0]*140+[1]*10) # 140 normal observations, 10 anomalies
X = np.c_[X, target] # a column that marks anomalies is included into X!
X_train = X[:140, :] # include only normal observations into train set
np.random.seed(1)

# a forest with raw data
forest1 = IsolationForest(contamination=0.0001).fit(X_train)
print(roc_auc_score(target, -forest1.decision_function(X)))
# this ROC AUC is only 0.6475, not a good model

# a forest with jittered data
forest2 = IsolationForest(contamination=0.0001).fit(X_train + np.random.normal(size=X_train.shape, scale=0.0001))
print(roc_auc_score(target, -forest2.decision_function(X)))
# this ROC AUC is 0.9521, much better

However, if you know that $z=0$ for normal data and $z\neq0$ for anomalies, you do not need any isolation forest - you already have a very simple and sound decision rule!

Solved – Feature Importance in Isolation Forest

I believe it was not implemented in scikit-learn because in contrast with Random Forest algorithm, Isolation Forest feature to split at each node is selected at random. So it is not possible to have a notion of feature importance similar to RF.

Having said that, If you are very confident about the results of Isolation Forest classifier and you have a capacity to train another model then you could use the output of Isolation Forest i.e -1/1 values as target-class to train a Random Forest classifier. This will give you feature importance for detecting anomaly.

Please note that I haven't tried this myself, so I can't comment on accuracy of this proposed approach.