Solved – How to avoid random forest overfitting and improve prediction

classificationoverfittingpredictive-modelsrandom forestunbalanced-classes

I have an input dataset x_train and an output dataset y_train

    > head(x_train)
      Symscore1 Symscore2 exercise3 exerciseduration3                  groupchange
    1         1         0         2                 3 Transitional to Transitional
    2         1         3         5                 3 Transitional to Transitional
    3         1         0         1                 0 Transitional to Transitional
    4         1         0         4                 3 Transitional to Transitional
    5         1         0         1                 0   Transitional to Menopausal
    6         0         0         5                 2   Transitional to Menopausal
      age3 packyears     bmi3 education3
    1   55  0.000000 20.89796 Highschool
    2   49  1.000000 20.20038 Highschool
    3   58  8.928572 30.47797      Basic
    4   51  0.000000 34.13111 Highschool
    5   52  2.357143 23.24380      Basic
    6   62  2.000000 16.76574 University

    > summary(x_train)
       Symscore1        Symscore2        exercise3     exerciseduration3
     Min.   :0.0000   Min.   :0.0000   Min.   :0.000   Min.   :0.000    
     1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000    
     Median :0.0000   Median :0.0000   Median :4.000   Median :3.000    
     Mean   :0.6985   Mean   :0.7276   Mean   :3.612   Mean   :2.503    
     3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:5.000   3rd Qu.:3.000    
     Max.   :5.0000   Max.   :5.0000   Max.   :5.000   Max.   :4.000    
                           groupchange        age3         packyears     
     Regular to Regular          : 399   Min.   :45.00   Min.   : 0.000  
     Regular to Transitional     : 211   1st Qu.:49.00   1st Qu.: 0.000  
     Regular to Menopausal       : 211   Median :54.00   Median : 0.000  
     Transitional to Transitional:1033   Mean   :53.68   Mean   : 4.012  
     Transitional to Menopausal  :1016   3rd Qu.:58.00   3rd Qu.: 5.000  
     Menopausal to Menopausal    : 188   Max.   :66.00   Max.   :97.143  
          bmi3            education3  
     Min.   :16.10   Basic     : 360  
     1st Qu.:22.32   Highschool:1225  
     Median :24.77   University:1473  
     Mean   :25.62                    
     3rd Qu.:27.73                    
     Max.   :66.10  

> dim(x_train)
[1] 3058    9
> 

    > summary(y_train)
       0    1    2    3    4    5 
    1794  737  299  129   69   30 
    > 

From the R package randomForest I run a classification forest

rf_c = randomForest(x = x_train, y=y_train,ntree = 100, type="classification")

The training error is very low while the prediction performance are very bad

> table(predict(rf_c, newdata = x_train), y_train)
   y_train
       0    1    2    3    4    5
  0 1794   10    6    2    0    0
  1    0  727    0    0    0    0
  2    0    0  293    0    0    0
  3    0    0    0  127    0    0
  4    0    0    0    0   69    0
  5    0    0    0    0    0   30
> rf_c$confusion
     0   1  2  3 4 5 class.error
0 1535 221 28  8 2 0   0.1443701
1  497 182 41 12 3 2   0.7530529
2  157  77 43 14 8 0   0.8561873
3   52  42 19 10 5 1   0.9224806
4   25  13 27  1 3 0   0.9565217
5    4  13  6  4 3 0   1.0000000
> 

This makes me think that there is an over fitting problem.
According to the theory random forest should never over fit and this seems very strange to me.

As can be observed from the data y_train is very unbalanced and the 0 class is much larger than the others. This can maybe be the cause of such low performance.

I would like to know how can improve this model. Every borderline comment is also welcome.

Best Answer

Donbeo, a couple of pointers:

  1. When training a random forest model, you need to optimize the tuning parameter mtry, which is the number of features randomly selected for each tree. Use five- or ten-fold cross-validation for this. The reason why mtry could influence out-of-sample prediction error is that when you grow larger trees, the trees are going to be more correlated with one another.

  2. When training a random forest model, you should also grow a large enough forest. Perhaps 100 trees is not a big enough forest. Try growing a bigger forest in addition to optimize mtry. You need not worry about the size of the forest leading to over-fitting. Actually, the bigger the forest, the better (although there are diminishing returns).

Related Question