Solved – Response-distribution-dependent bias in random forest regression

rrandom forestregression

I am using the randomForest package in R (R version 2.13.1, randomForest version 4.6-2) for regression and noticed a significant bias in my results: the prediction error is dependent on the value of the response variable. High values are under-predicted and low values are over-predicted. At first I suspected this was a consequence of my data but the following simple example suggests that this is inherent to the random forest algorithm:

n = 1000; 
x1 = rnorm(n, mean = 0, sd = 1)
response = x1
predictors = data.frame(x1=x1) 
rf = randomForest(x=predictors, y=response)
error = response-predict(rf, predictors)
plot(x1, error)

I suspect the bias is dependent on the distribution of the response, for example, if x1 is uniformly-distributed, there is no bias; if x1 is exponentially distributed, the bias is one-sided. Essentially, the values of the response at the tails of a normal distribution are outliers. It is no surprise that a model would have difficulty predicting outliers. In the case of randomForest, a response value of extreme magnitude from the tail of a distribution is less likely to end up in a terminal leaf and its effect will be washed out in the ensemble average.

Note that I tried to capture this effect in a previous example, "RandomForest in R linear regression tails mtry". This was a bad example. If the bias in the above example is truly inherent to the algorithm, it follows that a bias correction could be formulated given the response distribution one is trying to predict, resulting in more accurate predictions.

Are tree-based methods, such as random forest, subject to response distribution bias? If so, is this previously known to the statistics community and how is it usually corrected (e.g. a second model that uses the residuals of the biased model as input)?

Correction of a response-dependent bias is difficult because, by nature, the response is not known. Unfortunately, the estimate/predicted response does not often share the same relationship to the bias.

Best Answer

It is perfectly as you suspect -- the fact that leaf nodes contain means over some set of objects make any regression tree model tighten the response distribution and make any extrapolation impossible. Ensemble of course does not help with that and in fact make situation worse.

The naive solution (and dangerous because of overfitting) is to wrap the model in some kind of classical regression which would rescale the response to its desired distribution.

The better solution is one of the model-in-leaf tree models, like for instance MOB in party package. The idea here is that partitioning of feature space should end when the problem is simplified not to a simple value (as in regular tree) but to a simple relation (say linear) between the response and some predictors. Such relation can be now resolved by fitting some simple model which won't disturb the distribution or trim extreme values and would be able to extrapolate.