Solved – Does Isolation Forest need an anomaly sample during training

anomaly detectiondatasetisolation-forestmachine learningscikit learn

I am using Isolation Forest for anomaly detection (scikit implementation in python).
My data have 1000 dimensions. My normal data, which I use for training Isolation Forest model, has only to features non zero. All samples are unique but differ only in two components.

My anomaly samples on the other hand differ from normal data in all other components, but not those two.

When I train Isolation Forest model on my train data only, the model does not detect any anomaly.

If I add one anomaly to the training set and train another model, this model detects almost everything correctly including low false positive count.

I think it's not ok that the model needs an anomaly sample.
Am I doing it correctly?
How to fix it?
Thank you

Best Answer

First, some notation: let $x=(x_1, x_2)$ be the features that vary within the normal data, and $z$ be all the other features (that equal zero in the normal data).

Your model did not use $z$, because $z$ was useless for the training data, because it contained no variation. If you want to force $z$ into the model, you can add small random noize to it: $\tilde{z}:=z+\varepsilon$. The variance of $\varepsilon$ may be on the same scale with your measurement error, or just represent your notion of how large $z$ is still "normal". A quick illustration:

import numpy as np
from sklearn.datasets import load_iris
from sklearn.ensemble import IsolationForest
from sklearn.metrics import roc_auc_score
X, _ = load_iris(return_X_y=True)
target = np.array([0]*140+[1]*10) # 140 normal observations, 10 anomalies
X = np.c_[X, target] # a column that marks anomalies is included into X!
X_train = X[:140, :] # include only normal observations into train set
np.random.seed(1)

# a forest with raw data
forest1 = IsolationForest(contamination=0.0001).fit(X_train)
print(roc_auc_score(target, -forest1.decision_function(X)))
# this ROC AUC is only 0.6475, not a good model

# a forest with jittered data
forest2 = IsolationForest(contamination=0.0001).fit(X_train + np.random.normal(size=X_train.shape, scale=0.0001))
print(roc_auc_score(target, -forest2.decision_function(X)))
# this ROC AUC is 0.9521, much better

However, if you know that $z=0$ for normal data and $z\neq0$ for anomalies, you do not need any isolation forest - you already have a very simple and sound decision rule!

Related Question