Solved – RandomForest in R: Bad performance on training set

rrandom forestregression

I'm trying to use a random forest for regression. However, it does not perform well even on the training set, not to mention the test set.
I'm now wondering whether this is caused by bad quality input data, or if I can improve something in my approach?

Here is my data and model:

n=430
2 continuous input variables
1 categorical input variable
1 continuous output variable

Background:
I try to predict some environment-related data from some financial-related data (thus, it is not guaranteed that there is really a clean connection within the data!)

> Input 1
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
       60     52154    366902   9754180   2342790 341465729 

> Input 2:
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
       21     14043     89800   2600502    561641 108610665 

> Input 3:
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
5.938e+06 2.924e+09 7.511e+09 1.842e+10 2.198e+10 2.828e+11 

> Output:
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
       0     8032   282167  2638721  2048726 68796039

Formula (caret package):

control <- trainControl(method="repeatedcv",number=10, repeats = 3, verbose = TRUE, savePredictions = TRUE)
fit<-train(x=train_parametres, y=train_result,
           data=analyse,
           method="rf",
           trControl=control,
           importance=TRUE,
           allowParallel=TRUE,
           ntree=2000
)

Output (results for "fit" from cross-validation on training set):

Random Forest 

324 samples
  3 predictor

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times) 
Summary of sample sizes: 292, 292, 292, 291, 292, 291, ... 
Resampling results across tuning parameters:

  mtry  RMSE     Rsquared 
  2     4983092  0.5596401
  3     5128162  0.5452369

RMSE was used to select the optimal model using  the smallest value.
The final value used for the model was mtry = 2.

Changing nodesize, ntree and mtry does not alter the results very much.
Is it thus a problem of data quality, or are there other ways to improve the model I have overlooked, e.g. through data normalization? To my knowledge, it should at least be possible to overfit the model and get better results on the training set.

Best Answer

Since you've already explored changing nodesize, ntree and mtry, you're left with two possible explanations for the low R²:

Most of the variation in your data is random i.e. not explainable by your predictors.
You have insufficient data .

Unfortunately, there's not a lot we can recommend based on the information you've presented. Random forests are structurally reasonably robust to overfitting because of bagging (but see the side note below), so I wouldn't be surprised if you can't push that R² higher.

Side note: At mtry = 3, you're using all your predictors at every split. Since you have only 3 predictors, that negates one of the ways that random forests work: 'feature bagging', or the 'random subspace method'. This is explained well in this answer.

Related Solutions

Solved – random forest classification in R – no separation in training set

Because of your sample size, I would recommend using leave one out cross validation to estimate model fit. The algorithm goes like this:

Take out an observation from the data set.
Fit the model.
Estimate the class label of the held out observation.
Repeat steps 1-3 until all observations have been held out.

To see the algorithms performance, look at how well it fit the held out data.

You may find, as Simone suggested, that your algorithm is overfitting the data.

One thing that I have found is that you need much more than 16 observations to fit a good classification model.

Check out the cvTools package to perform the cross validation.

Random Forest Regression – Why Predictions on Training Data Don’t Align with x=y Line

This is actually to be expected, not just with random forests, and comes about as a consequence of the fact that the variance of the target variable = the variance of the model (the estimates) + the variance of the residuals (for least-squares type fitting procedures.) Given that the latter is positive, unless your model fits perfectly, it must be that the variance of the model < the variance of the target variable. As a result, the prediction vs. actual plot can't lie on the 45-degree line passing through 0; if it did, the variance of the target variable would be equal to the variance of the model, and there would be no room left for residual variance.

Here are four plots to illustrate this point with linear regression. In the first one, the error variance is relatively high, and, as a consequence, the predicted - vs - actual plot isn't anywhere near the diagonal line. In the second through fourth, the error variance is much lower, and the predicted - vs - actual plot gets much closer to the diagonal line.

First, the code:

x <- rnorm(1000)
y <- x + rnorm(1000,0,2) # rnorm(1000,0,1), rnorm(1000,0,0.5), rnorm(1000,0,0.1) 

plotlim <- range(y)
plot(predict(lm(y~x))~y,ylim=plotlim,xlim=plotlim)
abline(c(0,1))

Now, the plots:

enter image description here

Consequently, there's no need to alter your fitting procedure or augment your model.

Further heuristic explanation: Note that this comes about because $\sigma^2_Y > \sigma^2_X$, in this particular linear regression model. Therefore, even with the true parameter values (in this case, 0 intercept and 1 slope), the plot of $Y$ will be more spread out than the plot of $X$, and, since the estimated values of $Y$ with the true parameter values will equal $X$, it will also be the case that the plot of $Y$ will be more spread out than the plot of the estimated values of $Y$. As a result, the estimated values vs. true values plot will not lie on a 45-degree line.

Best Answer

Related Solutions

Solved – random forest classification in R – no separation in training set

Random Forest Regression – Why Predictions on Training Data Don’t Align with x=y Line

Related Question