Solved – Response-distribution-dependent bias in random forest regression

rrandom forestregression

I am using the randomForest package in R (R version 2.13.1, randomForest version 4.6-2) for regression and noticed a significant bias in my results: the prediction error is dependent on the value of the response variable. High values are under-predicted and low values are over-predicted. At first I suspected this was a consequence of my data but the following simple example suggests that this is inherent to the random forest algorithm:

n = 1000; 
x1 = rnorm(n, mean = 0, sd = 1)
response = x1
predictors = data.frame(x1=x1) 
rf = randomForest(x=predictors, y=response)
error = response-predict(rf, predictors)
plot(x1, error)

I suspect the bias is dependent on the distribution of the response, for example, if x1 is uniformly-distributed, there is no bias; if x1 is exponentially distributed, the bias is one-sided. Essentially, the values of the response at the tails of a normal distribution are outliers. It is no surprise that a model would have difficulty predicting outliers. In the case of randomForest, a response value of extreme magnitude from the tail of a distribution is less likely to end up in a terminal leaf and its effect will be washed out in the ensemble average.

Note that I tried to capture this effect in a previous example, "RandomForest in R linear regression tails mtry". This was a bad example. If the bias in the above example is truly inherent to the algorithm, it follows that a bias correction could be formulated given the response distribution one is trying to predict, resulting in more accurate predictions.

Are tree-based methods, such as random forest, subject to response distribution bias? If so, is this previously known to the statistics community and how is it usually corrected (e.g. a second model that uses the residuals of the biased model as input)?

Correction of a response-dependent bias is difficult because, by nature, the response is not known. Unfortunately, the estimate/predicted response does not often share the same relationship to the bias.

Best Answer

It is perfectly as you suspect -- the fact that leaf nodes contain means over some set of objects make any regression tree model tighten the response distribution and make any extrapolation impossible. Ensemble of course does not help with that and in fact make situation worse.

The naive solution (and dangerous because of overfitting) is to wrap the model in some kind of classical regression which would rescale the response to its desired distribution.

The better solution is one of the model-in-leaf tree models, like for instance MOB in party package. The idea here is that partitioning of feature space should end when the problem is simplified not to a simple value (as in regular tree) but to a simple relation (say linear) between the response and some predictors. Such relation can be now resolved by fitting some simple model which won't disturb the distribution or trim extreme values and would be able to extrapolate.

Related Solutions

Random Forest – Low Explained Variance in R randomForest

edited response*

Some few changes which may drive forward a little signal...

Scaling: RF is only scaling invariant to the features not to the responses. RFreg uses mean square error as loss function and CV squared residuals to assess performance. Try to take the logarithm or sqaure root to your responses to lower leverage of few 'outliers'.

Filtering: Use the function rfcv from randomForest to select variables. Otherwise a linear filter may be useful.

Collinearity filtering: "I checked chi-square between pairs of variables and removed the ones that could be associated (p-value < 0.05), but the result is the same." -Don't use a specific p-value threshold < 0.05. Use any threshold by any similarity measures which makes your model work (CV performance). Did you remove both members of the pairs?

Variable importance: Variable importance of broken model should not be trusted.

Evaluating RF perfomance: That RF do fit it's own training set is irrelevant. The trees of RFreg are grown almost to max depth and will overfit the training set. Only cross-validation(segmentaion, OOB, nFold, etc.) can be used to assess the performance. The following code shows how %var explained is computed and how the OOB prediction is made.

library(randomForest)
obs = 500
vars = 100
X = replicate(vars,factor(sample(1:3,obs,replace=T)))
y = rnorm(obs,sd=5)^2

RF = randomForest(X,y,importance=T,ntree=20,keep.inbag=T)

#var explained printed
print(RF)
cat("% Var explained: \n", 100 * (1-sum((RF$y-RF$pred   )^2) /
                                    sum((RF$y-mean(RF$y))^2)
                                  )
   )

#how out-of-bag predicted values are formed 
#matrix of i row obs with j col predictions from j trees
allTreePred = predict(RF,X,predict.all=T)$individual 
#for i'th sample take mean of those trees where i'th sample was OOB (inbag==0)
OOBpred = sapply(1:obs,function(i) mean(allTreePred[i,RF$inbag[i,]==0]))

#we can see the values are the same +/- float precision
hist(OOBpred-RF$predicted)

#if using RF to predict it's own training data
Ypred = predict(RF,X)

#any obs (i) will be present in ~0.62 of the nodes and influence it's own
#prediction value. Therefore does the following prediction plot falsely
#look promising
    par(mfrow=c(1,2),mar=c(4,4,3,3))
    ylims=range(c(pred,OOBpred))
    plot(y   ,Ypred,ylim=ylims,main=paste("simple pred \n
                                         R^2=" ,round(cor(y,Ypred  ),2)))
    plot(y,OOBpred,ylim=ylims,main=paste("OOB prediction \n
                                         R^2=" ,round(cor(y,OOBpred),2)))

enter image description here

Solved – regression with circular response variable

The pattern in the residuals is not necessarily a problem. One way to check this is to simulate a set of responses from the model that you just fitted (that is, under the assumption that the model is correct), fit a new model to the results, and plot its residuals. This gives you a measure of how weird you would expect the plot to look even if nothing were wrong.

If you can reliably pick out the original model from several plots produced in this way, then you should start worrying about the residual pattern.

Best Answer

Related Solutions

Random Forest – Low Explained Variance in R randomForest

Solved – regression with circular response variable

Related Question