I've noticed that when building random forest regression models, at least in R
, the predicted value never exceeds the maximum value of the target variable seen in the training data. As an example, see the code below. I'm building a regression model to predict mpg
based on the mtcars
data. I build OLS and random forest models, and use them to predict mpg
for a hypothetical car that should have very good fuel economy. The OLS predicts a high mpg
, as expected, but random forest does not. I've noticed this in more complex models too. Why is this?
> library(datasets)
> library(randomForest)
>
> data(mtcars)
> max(mtcars$mpg)
[1] 33.9
>
> set.seed(2)
> fit1 <- lm(mpg~., data=mtcars) #OLS fit
> fit2 <- randomForest(mpg~., data=mtcars) #random forest fit
>
> #Hypothetical car that should have very high mpg
> hypCar <- data.frame(cyl=4, disp=50, hp=40, drat=5.5, wt=1, qsec=24, vs=1, am=1, gear=4, carb=1)
>
> predict(fit1, hypCar) #OLS predicts higher mpg than max(mtcars$mpg)
1
37.2441
> predict(fit2, hypCar) #RF does not predict higher mpg than max(mtcars$mpg)
1
30.78899
Best Answer
As it has been mentioned already in previous answers, random forest for regression / regression trees doesn't produce expected predictions for data points beyond the scope of training data range because they cannot extrapolate (well). A regression tree consists of a hierarchy of nodes, where each node specifies a test to be carried out on an attribute value and each leaf (terminal) node specifies a rule to calculate a predicted output. In your case the testing observation flow through the trees to leaf nodes stating, e.g., "if x > 335, then y = 15", which are then averaged by random forest.
Here is an R script visualizing the situation with both random forest and linear regression. In random forest's case, predictions are constant for testing data points that are either below the lowest training data x-value or above the highest training data x-value.