Solved – Validating Isolation Forests

ensemble learningisolation-forestpythonscikit learn

I have a dataset where I'd like to perform anomaly detection with an Isolation Forest. I don't have any way to validate the model (my data is not labeled – that's why I'm using unsupervised learning) – how can I tell if the model is working all right? I could do a train-test split, but again, how do I know if the predictions are correct if I'm using unlabelled data (plus I'd like to have as many words in my tf-idf vectoriser as possible, but that's another question)? There isn't any data about the amount of contamination either. How should I fine-tune the parameters/validate the results?

Best Answer

For isolation forest, here is a clue for validation reference.

From the paper and sklearn lib,we know there are two key parameters: n_estimators and max_samples.

when n_estimator = 100, the average path( score of outlier) is convergence.
when max_samples = 256(the default parameter), the different dataset will be convergence similar auc. that means no need to spend time for bigger samples.

According to the paper, they test difference n_estimator and max_sample to make sure the result is convergence.

So, for you case, you may need use grid search to check which combination of n_estimators and max_samples could be quick reach convergence.

In my case, I use titanic dataset. After scale raw data, the feature quantity is about 120. I used gridsearch and found n_estimator =200 and max_sample= 256, the outlier prediction begin to convergence.

This is my way to validate unsupervised outlier detection.

Hope it helpful.

Related Solutions

Solved – Does Isolation Forest need an anomaly sample during training

First, some notation: let $x=(x_1, x_2)$ be the features that vary within the normal data, and $z$ be all the other features (that equal zero in the normal data).

Your model did not use $z$, because $z$ was useless for the training data, because it contained no variation. If you want to force $z$ into the model, you can add small random noize to it: $\tilde{z}:=z+\varepsilon$. The variance of $\varepsilon$ may be on the same scale with your measurement error, or just represent your notion of how large $z$ is still "normal". A quick illustration:

import numpy as np
from sklearn.datasets import load_iris
from sklearn.ensemble import IsolationForest
from sklearn.metrics import roc_auc_score
X, _ = load_iris(return_X_y=True)
target = np.array([0]*140+[1]*10) # 140 normal observations, 10 anomalies
X = np.c_[X, target] # a column that marks anomalies is included into X!
X_train = X[:140, :] # include only normal observations into train set
np.random.seed(1)

# a forest with raw data
forest1 = IsolationForest(contamination=0.0001).fit(X_train)
print(roc_auc_score(target, -forest1.decision_function(X)))
# this ROC AUC is only 0.6475, not a good model

# a forest with jittered data
forest2 = IsolationForest(contamination=0.0001).fit(X_train + np.random.normal(size=X_train.shape, scale=0.0001))
print(roc_auc_score(target, -forest2.decision_function(X)))
# this ROC AUC is 0.9521, much better

However, if you know that $z=0$ for normal data and $z\neq0$ for anomalies, you do not need any isolation forest - you already have a very simple and sound decision rule!

Solved – Feature Importance in Isolation Forest

I believe it was not implemented in scikit-learn because in contrast with Random Forest algorithm, Isolation Forest feature to split at each node is selected at random. So it is not possible to have a notion of feature importance similar to RF.

Having said that, If you are very confident about the results of Isolation Forest classifier and you have a capacity to train another model then you could use the output of Isolation Forest i.e -1/1 values as target-class to train a Random Forest classifier. This will give you feature importance for detecting anomaly.

Please note that I haven't tried this myself, so I can't comment on accuracy of this proposed approach.

Best Answer

Related Solutions

Solved – Does Isolation Forest need an anomaly sample during training

Solved – Feature Importance in Isolation Forest

Related Question