Solved – Random forest regression for outlier detection

anomaly detectionrandom forestregression

I am using Random Forest (RF) on a set of financial data. Normally when using RF, you split the data into a training and a test dataset.

However, I just want to find outliers.

When using RF to detect outliers, what is the best way to do this? Just run a normal model, and then use that model to predict on the same dataset as the model was built upon? Does it make sense to have a train and test dataset?

Best Answer

There are some well developed algorithms that use trees to detect outliers. The key observation in these algorithms is that outliers correspond to short path lengths in fully developed trees. In other words, if you train a tree to leave only one sample at each leaf, you will notice that the path length (i.e. number of splits) of outliers is relatively small. So, the path length is a measure of normality. In a random forest setting, the feature used to split data is chosen randomly from the set of all possible features. Thus, if you use 10 trees, your measure of normality of a certain point will be a function of its average path length among all trees. This is essentially the idea of isolation forest algorithm. Although it uses Extra tree regressions as a bagging model, the intuition is the same. I'm not sure if there are other, conceptually different, algorithms for outlier detection using trees, but this approach is pretty sound.

Related Question