Solved – RandomForests in Matlab and outliers detection

MATLABoutliersrandom forest

I am solving some regression problem with RandomForests in Matlab, using it's default TreeBagger class for this task. While I managed to get reasonable result already, there are few questions which I can't find answers by simple google search. All questions below are for regression task.

  1. predict method in TreeBagger class returns predicted value but also it returns standard deviations of separate trees values. It's obvious that the higher this deviation is, the less reliable is result. But I am not sure if it is correct to make any decisions based on these values. So the question is, how standard deviations over the ensemble of trees can be used in practice? Can we do something like prediction outlier detection based on them?

  2. I've seen different papers which mention that RF can be used for outlier detection. Is it possible to do that with Matlab's TreeBagger and how? If I am solving regression problem using RF, can I add to this procedure additional step of outlier detection using RF?

  3. Let's asume that there are positive answers to my previous question, and I managed to build RF and set up an outlier detection procedure. Now, is it correct to do the outlier test for some new data before trying to predict it's value using built RF model?

  4. For my problem I get same result both for forests with 10 and 100 trees. Is there any good reason to choose model with higher number of trees? I always assumed that simpler model should always be chosen to avoid possible overfitting, but in light of my first question, the higher the number of trees the more precise and informative standard deviation of error over ensemble of trees from my first question is.

UPDATE:

Okay regarding my first question, I've tried and removed 25% cases from my validation set with highest stndard deviation of error over ensemble of trees. For remaining 75% of validation data MSE improved by 20%. This means that I am just skipping 25% percent of my data and don't make any prediction at all, but for other data this gives me 20% improvement in prediction quality, which is acceptable for my task. But I still belive that something more clever can be done with those standard deviations.

Best Answer

Regarding your 1st and 2nd questions TreeBagger has a Property called OutlierMeasure that can be the one you are looking for.

Edit: You may also benefit of reading this documentation that has working examples.