Solved – Correct way of evaluating Random Forest performance wrt training/test, feature selection, ntrees, random seed

accuracyfeature selectionmodel-evaluationpartitioningrandom forest

I need to use Random Forest in my experiments. Although using the same training and test datasets, each time that I train the Random Forest on my training set, I get a different result on my test set. I know that this variation is due the randomness in the algorithm. But how can I decrease that? What is the correct way of reporting the results if I need to compare two different feature sets for example? Should I run multiple times on the same training, test data and report the average test results?

My other related question is the forest size. What I understood from parameter tuning of random forest was that the bigger number of trees is better, it is not going to hurt accuracy, it just does not lead to huge improvement after a certain size. however, When I tried this parameter I noticed that after n=400, the accuracy decreases. Can someone explain me why? Is my understanding right?

Below you can find the result of my experiment, trying different forest sizes (n) and for each forest size, I trained and evaluated the RF 10 times. printed the accuracy of test set for each run and at the end reported the mean and std of the accuracy for the associated n.

n: 50
0: 0.585
1: 0.61
2: 0.588333333333
3: 0.606666666667
4: 0.598333333333
5: 0.595
6: 0.598333333333
7: 0.578333333333
8: 0.586666666667
9: 0.59
n: 50 mean: 0.593666666667 std: 0.00939266853574
-------------------------------
n: 100
0: 0.601666666667
1: 0.59
2: 0.586666666667
3: 0.603333333333
4: 0.6
5: 0.593333333333
6: 0.568333333333
7: 0.6
8: 0.596666666667
9: 0.591666666667
n: 100 mean: 0.593166666667 std: 0.00975961064797
-------------------------------
n: 200
0: 0.595
1: 0.595
2: 0.591666666667
3: 0.58
4: 0.606666666667
5: 0.625
6: 0.596666666667
7: 0.603333333333
8: 0.605
9: 0.61
n: 200 mean: 0.600833333333 std: 0.0115289490703
-------------------------------
n: 300
0: 0.608333333333
1: 0.596666666667
2: 0.605
3: 0.605
4: 0.593333333333
5: 0.613333333333
6: 0.611666666667
7: 0.595
8: 0.595
9: 0.616666666667
n: 300 mean: 0.604 std: 0.00810349718743
-------------------------------
n: 400
0: 0.601666666667
1: 0.601666666667
2: 0.608333333333
3: 0.608333333333
4: 0.606666666667
5: 0.601666666667
6: 0.606666666667
7: 0.598333333333
8: 0.591666666667
9: 0.606666666667
n: 400 mean: 0.603166666667 std: 0.00502493781056
-------------------------------
n: 500
0: 0.61
1: 0.598333333333
2: 0.6
3: 0.608333333333
4: 0.608333333333
5: 0.613333333333
6: 0.595
7: 0.6
8: 0.59
9: 0.598333333333
n: 500 mean: 0.602166666667 std: 0.00707303172464
-------------------------------
n: 600
0: 0.605
1: 0.596666666667
2: 0.61
3: 0.603333333333
4: 0.596666666667
5: 0.588333333333
6: 0.598333333333
7: 0.588333333333
8: 0.6
9: 0.6
n: 600 mean: 0.598666666667 std: 0.0064463598686
-------------------------------
n: 700
0: 0.593333333333
1: 0.605
2: 0.595
3: 0.596666666667
4: 0.603333333333
5: 0.61
6: 0.598333333333
7: 0.601666666667
8: 0.6
9: 0.601666666667
n: 700 mean: 0.6005 std: 0.00471699056603

Please let me know what is the correct way of evaluating the performance of Random Forest.

Best Answer

In order to obtain reproducible randomizations you have to set up a random seed beforehand, e.g. in R you do set.seed(17) where 17 is just a number I made up. When you do it you should get the same accuracy every random forest run when you fix the number of trees.

I don't think your accuracy really decreases when you increase the number of trees. The accuracy results with a few trees have higher variances and even if the mean value is slightly higher (in your case little higher) this does not mean they are better. Your results might be due to different randomization of the data. Try to set up a random seed beforehand.

If you don't have a specific set of records to use as test set you might obtain more stable results computing the accuracy by cross validation. A 10-fold cross validation is usually the measure of choice. For a better estimate you can also average the results of many cross validation with different randomizations. It might be time consuming, an average among 5 randomizations of 2-fold cross validation might be good practice if you don't have too few records. If you have a few records, for example less then 100, you can try the leave one out method.

If you want to learn more about classifier evaluation the book Evaluating Learning Algorithms: A Classification Perspective by Nathalie Japkowicz is a good reference.

Related Solutions

Solved – RandomForestClassifier Parameter Optimization

To answer your second question, why accuracy tails off, I put together an example in R that should resemble your problem. I generated ~50 good predictors and ~1000 bad predictors (that are just randomly assigned dummy variables). I start by increasing the number of good predictors, and then after maxing those out I incrementally add in all of the bad predictors.

This illustrates what you observe in your data - up to a point the predictors are good and adding value, then at some point you're adding in the worse features and they start to drown out the good features.

The (admittedly messy) code is below:

library(data.table)
library(randomForest)
set.seed(343)
y <- sample(c(0,1), size=1500, replace=TRUE, prob=c(.8,.2))

pct_seq <- seq(.2,.1,by=-.002)

good.x <- sample(c(1,0), size=1500, replace=TRUE, prob=c(.21,.79))
for(i in pct_seq) {
    samp1 <- sample(c(1,0), size=1500, replace=TRUE, prob=c(i,1-i))
    samp0 <- sample(c(1,0), size=1500, replace=TRUE, prob=c(i/5,1-i/5))
    good.x <- cbind(good.x,ifelse(y==1,samp1, samp0))
}

pct_seq <- rep(.02,1000)

bad.x <- sample(c(1,0), size=1500, replace=TRUE, prob=c(.01,.99))
for(i in pct_seq) {
    samp1 <- sample(c(1,0), size=1500, replace=TRUE, prob=c(i,1-i))
    samp0 <- sample(c(1,0), size=1500, replace=TRUE, prob=c(i,1-i))
    bad.x <- cbind(bad.x,ifelse(y==1,samp1, samp0))
}

x <- cbind(good.x,bad.x)
y.fac <- as.factor(y)

var.seq <- c(seq(11,51, by=10), seq(151,951, by=100))
model.results <- data.frame(0,0)

for (j in var.seq) {
    print(j)
    print(randomForest(x[,1:j],y.fac,ntree=1000))
}

Solved – Random Forest has almost perfect training AUC compared to other models

Because the ML algorithms works minimizing the error on the training, the expected accuracy on this data would be "naturally" better than your test results. Effectively when the training error is too low (aka accuracy too high) maybe there is something that has gone wrong (aka overfitting)

As suggested by user5957401, you can try to cross-validate the training process. For example, if you have a good amount of instances, a 10 fold cross-validation would be fine. If you need also to tune hyper parameters, a nested-cross validation would be necessary.

In this way the estimated error from the test-set will be "near" the expected one (aka, the one that you'll get on real Data). In this way, you can check if your result (AUC 0.80 on the test set) is a good estimate, or if you got this by chance

You can try also other techniques, like shuffling several times your data before the cross-validation task, to increase the result reliability.

Best Answer

Related Solutions

Solved – RandomForestClassifier Parameter Optimization

Solved – Random Forest has almost perfect training AUC compared to other models

Related Question