Solved – RandomForests in Matlab and outliers detection

MATLABoutliersrandom forest

I am solving some regression problem with RandomForests in Matlab, using it's default TreeBagger class for this task. While I managed to get reasonable result already, there are few questions which I can't find answers by simple google search. All questions below are for regression task.

predict method in TreeBagger class returns predicted value but also it returns standard deviations of separate trees values. It's obvious that the higher this deviation is, the less reliable is result. But I am not sure if it is correct to make any decisions based on these values. So the question is, how standard deviations over the ensemble of trees can be used in practice? Can we do something like prediction outlier detection based on them?
I've seen different papers which mention that RF can be used for outlier detection. Is it possible to do that with Matlab's TreeBagger and how? If I am solving regression problem using RF, can I add to this procedure additional step of outlier detection using RF?
Let's asume that there are positive answers to my previous question, and I managed to build RF and set up an outlier detection procedure. Now, is it correct to do the outlier test for some new data before trying to predict it's value using built RF model?
For my problem I get same result both for forests with 10 and 100 trees. Is there any good reason to choose model with higher number of trees? I always assumed that simpler model should always be chosen to avoid possible overfitting, but in light of my first question, the higher the number of trees the more precise and informative standard deviation of error over ensemble of trees from my first question is.

UPDATE:

Okay regarding my first question, I've tried and removed 25% cases from my validation set with highest stndard deviation of error over ensemble of trees. For remaining 75% of validation data MSE improved by 20%. This means that I am just skipping 25% percent of my data and don't make any prediction at all, but for other data this gives me 20% improvement in prediction quality, which is acceptable for my task. But I still belive that something more clever can be done with those standard deviations.

Best Answer

Regarding your 1st and 2nd questions TreeBagger has a Property called OutlierMeasure that can be the one you are looking for.

Edit: You may also benefit of reading this documentation that has working examples.

Related Solutions

Solved – a mathematical way to define a point on a scatter plot as an outlier

Without knowing everything about your data or what your project is, it's hard to suggest what the "right" method is. A better way to think about it is probably that either method may work, but that you need transparency in how you did it when you present your results.

If you are removing two outliers from 10,000, I don't think it particularly matters either way. If you are removing two outliers from 10 records, it becomes significantly more important!

In general, if you are using a 2SD method, I would say you should remove both of them at the same time - you set the exclusion criteria, and then you remove everything that doesn't fit it. It does not seem to me that you have any analytic justification for the other approach - why would you remove one and then recalculate?

With that said - if the outlying data points don't have extreme leverage on your model, or are generally unobtrusive, do you think it's necessary to even remove them? I usually suggest not dropping observations unless they are severely disruptive to modeling. HTH!

How to Define the Multiplier Range for Variance Test-Based Outlier Detection?

To find the outliers, you cannot use the distance of an observation to a model through a rule such as:

$$\frac{|\hat{\mu}-x_i|}{\times \hat{\sigma}},\;i=1,\ldots,n$$

if your estimates of $(\hat{\mu},\hat{\sigma})$ are the classical ones (the usual mean/standard deviation) because the fitting procedure you use to obtain them is itself liable to being pulled towards the outliers (this is called the masking effect).

One simple way to reliably detect outliers however is to use the general idea you suggested (distance from fit) but replacing the classical estimators by robust ones much less susceptible to be swayed by outliers. Below I present a general illustration of the idea. If you give more information about your specific problem I can append my answer to address the particulars of your situation.

An illustration: consider the following 20 observations drawn from a $\mathcal{N}(0,1)$ (rounded to the second digit):

x<-c(-2.21,-1.84,-.95,-.91,-.36,-.19,-.11,-.1,.18,
.3,.31,.43,.51,.64,.67,.72,1.22,1.35,8.1,17.6)

(the last two really ought to be .81 and 1.76 but have been accidentally misstyped).

Using a outlier detection rule based on comparing the statistic

$$\frac{|x_i-\text{ave}(x_i)|}{\text{sd}(x_i)}$$

to the quantiles of a normal distribution would never lead you to suspect that 8.1 is an outlier, leading you to estimate the $\text{sd}$ of the 'trimmed' series to be 2 (for comparison the raw, e.g. untrimmed, estimate of $\text{sd}$ is 4.35).

Had you used a robust statistic instead:

$$\frac{|x_i-\text{med}(x_i)|}{\text{mad}(x_i)}$$

and comparing the resulting robust $z$-scores to the choosen quantiles of a candidate distribution (typically the standard normal if you can assume the $x_i$'s to be symetrically distributed) you would have correctly the last two observations as outliers (and correctly estimated the $\text{sd}$ of the trimmed series to be 0.96).

Related Question