I am using Isolation Forest for anomaly detection (scikit implementation in python).
My data have 1000 dimensions. My normal data, which I use for training Isolation Forest model, has only to features non zero. All samples are unique but differ only in two components.
My anomaly samples on the other hand differ from normal data in all other components, but not those two.
When I train Isolation Forest model on my train data only, the model does not detect any anomaly.
If I add one anomaly to the training set and train another model, this model detects almost everything correctly including low false positive count.
I think it's not ok that the model needs an anomaly sample.
Am I doing it correctly?
How to fix it?
Thank you
Best Answer
First, some notation: let $x=(x_1, x_2)$ be the features that vary within the normal data, and $z$ be all the other features (that equal zero in the normal data).
Your model did not use $z$, because $z$ was useless for the training data, because it contained no variation. If you want to force $z$ into the model, you can add small random noize to it: $\tilde{z}:=z+\varepsilon$. The variance of $\varepsilon$ may be on the same scale with your measurement error, or just represent your notion of how large $z$ is still "normal". A quick illustration:
However, if you know that $z=0$ for normal data and $z\neq0$ for anomalies, you do not need any isolation forest - you already have a very simple and sound decision rule!