Solved – RandomForest in R: Bad performance on training set

rrandom forestregression

I'm trying to use a random forest for regression. However, it does not perform well even on the training set, not to mention the test set.
I'm now wondering whether this is caused by bad quality input data, or if I can improve something in my approach?

Here is my data and model:

  • n=430
  • 2 continuous input variables
  • 1 categorical input variable
  • 1 continuous output variable

Background:
I try to predict some environment-related data from some financial-related data (thus, it is not guaranteed that there is really a clean connection within the data!)

> Input 1
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
       60     52154    366902   9754180   2342790 341465729 

> Input 2:
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
       21     14043     89800   2600502    561641 108610665 

> Input 3:
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
5.938e+06 2.924e+09 7.511e+09 1.842e+10 2.198e+10 2.828e+11 

> Output:
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
       0     8032   282167  2638721  2048726 68796039 

Formula (caret package):

control <- trainControl(method="repeatedcv",number=10, repeats = 3, verbose = TRUE, savePredictions = TRUE)
fit<-train(x=train_parametres, y=train_result,
           data=analyse,
           method="rf",
           trControl=control,
           importance=TRUE,
           allowParallel=TRUE,
           ntree=2000
)

Output (results for "fit" from cross-validation on training set):

Random Forest 

324 samples
  3 predictor

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times) 
Summary of sample sizes: 292, 292, 292, 291, 292, 291, ... 
Resampling results across tuning parameters:

  mtry  RMSE     Rsquared 
  2     4983092  0.5596401
  3     5128162  0.5452369

RMSE was used to select the optimal model using  the smallest value.
The final value used for the model was mtry = 2.

Changing nodesize, ntree and mtry does not alter the results very much.
Is it thus a problem of data quality, or are there other ways to improve the model I have overlooked, e.g. through data normalization? To my knowledge, it should at least be possible to overfit the model and get better results on the training set.

Best Answer

Since you've already explored changing nodesize, ntree and mtry, you're left with two possible explanations for the low R2:

  1. Most of the variation in your data is random i.e. not explainable by your predictors.
  2. You have insufficient data .

Unfortunately, there's not a lot we can recommend based on the information you've presented. Random forests are structurally reasonably robust to overfitting because of bagging (but see the side note below), so I wouldn't be surprised if you can't push that R2 higher.

Side note: At mtry = 3, you're using all your predictors at every split. Since you have only 3 predictors, that negates one of the ways that random forests work: 'feature bagging', or the 'random subspace method'. This is explained well in this answer.

Related Question