Solved – Validating Isolation Forests

ensemble learningisolation-forestpythonscikit learn

I have a dataset where I'd like to perform anomaly detection with an Isolation Forest. I don't have any way to validate the model (my data is not labeled – that's why I'm using unsupervised learning) – how can I tell if the model is working all right? I could do a train-test split, but again, how do I know if the predictions are correct if I'm using unlabelled data (plus I'd like to have as many words in my tf-idf vectoriser as possible, but that's another question)? There isn't any data about the amount of contamination either. How should I fine-tune the parameters/validate the results?

Best Answer

For isolation forest, here is a clue for validation reference.

From the paper and sklearn lib,we know there are two key parameters: n_estimators and max_samples.

  • when n_estimator = 100, the average path( score of outlier) is convergence.
  • when max_samples = 256(the default parameter), the different dataset will be convergence similar auc. that means no need to spend time for bigger samples.

According to the paper, they test difference n_estimator and max_sample to make sure the result is convergence.

So, for you case, you may need use grid search to check which combination of n_estimators and max_samples could be quick reach convergence.

In my case, I use titanic dataset. After scale raw data, the feature quantity is about 120. I used gridsearch and found n_estimator =200 and max_sample= 256, the outlier prediction begin to convergence.

This is my way to validate unsupervised outlier detection.

Hope it helpful.