I have an input dataset x_train
and an output dataset y_train
> head(x_train)
Symscore1 Symscore2 exercise3 exerciseduration3 groupchange
1 1 0 2 3 Transitional to Transitional
2 1 3 5 3 Transitional to Transitional
3 1 0 1 0 Transitional to Transitional
4 1 0 4 3 Transitional to Transitional
5 1 0 1 0 Transitional to Menopausal
6 0 0 5 2 Transitional to Menopausal
age3 packyears bmi3 education3
1 55 0.000000 20.89796 Highschool
2 49 1.000000 20.20038 Highschool
3 58 8.928572 30.47797 Basic
4 51 0.000000 34.13111 Highschool
5 52 2.357143 23.24380 Basic
6 62 2.000000 16.76574 University
> summary(x_train)
Symscore1 Symscore2 exercise3 exerciseduration3
Min. :0.0000 Min. :0.0000 Min. :0.000 Min. :0.000
1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
Median :0.0000 Median :0.0000 Median :4.000 Median :3.000
Mean :0.6985 Mean :0.7276 Mean :3.612 Mean :2.503
3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:5.000 3rd Qu.:3.000
Max. :5.0000 Max. :5.0000 Max. :5.000 Max. :4.000
groupchange age3 packyears
Regular to Regular : 399 Min. :45.00 Min. : 0.000
Regular to Transitional : 211 1st Qu.:49.00 1st Qu.: 0.000
Regular to Menopausal : 211 Median :54.00 Median : 0.000
Transitional to Transitional:1033 Mean :53.68 Mean : 4.012
Transitional to Menopausal :1016 3rd Qu.:58.00 3rd Qu.: 5.000
Menopausal to Menopausal : 188 Max. :66.00 Max. :97.143
bmi3 education3
Min. :16.10 Basic : 360
1st Qu.:22.32 Highschool:1225
Median :24.77 University:1473
Mean :25.62
3rd Qu.:27.73
Max. :66.10
> dim(x_train)
[1] 3058 9
>
> summary(y_train)
0 1 2 3 4 5
1794 737 299 129 69 30
>
From the R package randomForest I run a classification forest
rf_c = randomForest(x = x_train, y=y_train,ntree = 100, type="classification")
The training error is very low while the prediction performance are very bad
> table(predict(rf_c, newdata = x_train), y_train)
y_train
0 1 2 3 4 5
0 1794 10 6 2 0 0
1 0 727 0 0 0 0
2 0 0 293 0 0 0
3 0 0 0 127 0 0
4 0 0 0 0 69 0
5 0 0 0 0 0 30
> rf_c$confusion
0 1 2 3 4 5 class.error
0 1535 221 28 8 2 0 0.1443701
1 497 182 41 12 3 2 0.7530529
2 157 77 43 14 8 0 0.8561873
3 52 42 19 10 5 1 0.9224806
4 25 13 27 1 3 0 0.9565217
5 4 13 6 4 3 0 1.0000000
>
This makes me think that there is an over fitting problem.
According to the theory random forest should never over fit and this seems very strange to me.
As can be observed from the data y_train is very unbalanced and the 0 class is much larger than the others. This can maybe be the cause of such low performance.
I would like to know how can improve this model. Every borderline comment is also welcome.
Best Answer
Donbeo, a couple of pointers:
When training a random forest model, you need to optimize the tuning parameter
mtry
, which is the number of features randomly selected for each tree. Use five- or ten-fold cross-validation for this. The reason whymtry
could influence out-of-sample prediction error is that when you grow larger trees, the trees are going to be more correlated with one another.When training a random forest model, you should also grow a large enough forest. Perhaps 100 trees is not a big enough forest. Try growing a bigger forest in addition to optimize
mtry
. You need not worry about the size of the forest leading to over-fitting. Actually, the bigger the forest, the better (although there are diminishing returns).