Solved – Correct way of evaluating Random Forest performance wrt training/test, feature selection, ntrees, random seed

accuracyfeature selectionmodel-evaluationpartitioningrandom forest

I need to use Random Forest in my experiments. Although using the same training and test datasets, each time that I train the Random Forest on my training set, I get a different result on my test set. I know that this variation is due the randomness in the algorithm. But how can I decrease that? What is the correct way of reporting the results if I need to compare two different feature sets for example? Should I run multiple times on the same training, test data and report the average test results?

My other related question is the forest size. What I understood from parameter tuning of random forest was that the bigger number of trees is better, it is not going to hurt accuracy, it just does not lead to huge improvement after a certain size. however, When I tried this parameter I noticed that after n=400, the accuracy decreases. Can someone explain me why? Is my understanding right?

Below you can find the result of my experiment, trying different forest sizes (n) and for each forest size, I trained and evaluated the RF 10 times. printed the accuracy of test set for each run and at the end reported the mean and std of the accuracy for the associated n.

n: 50
0: 0.585
1: 0.61
2: 0.588333333333
3: 0.606666666667
4: 0.598333333333
5: 0.595
6: 0.598333333333
7: 0.578333333333
8: 0.586666666667
9: 0.59
n: 50 mean: 0.593666666667 std: 0.00939266853574
-------------------------------
n: 100
0: 0.601666666667
1: 0.59
2: 0.586666666667
3: 0.603333333333
4: 0.6
5: 0.593333333333
6: 0.568333333333
7: 0.6
8: 0.596666666667
9: 0.591666666667
n: 100 mean: 0.593166666667 std: 0.00975961064797
-------------------------------
n: 200
0: 0.595
1: 0.595
2: 0.591666666667
3: 0.58
4: 0.606666666667
5: 0.625
6: 0.596666666667
7: 0.603333333333
8: 0.605
9: 0.61
n: 200 mean: 0.600833333333 std: 0.0115289490703
-------------------------------
n: 300
0: 0.608333333333
1: 0.596666666667
2: 0.605
3: 0.605
4: 0.593333333333
5: 0.613333333333
6: 0.611666666667
7: 0.595
8: 0.595
9: 0.616666666667
n: 300 mean: 0.604 std: 0.00810349718743
-------------------------------
n: 400
0: 0.601666666667
1: 0.601666666667
2: 0.608333333333
3: 0.608333333333
4: 0.606666666667
5: 0.601666666667
6: 0.606666666667
7: 0.598333333333
8: 0.591666666667
9: 0.606666666667
n: 400 mean: 0.603166666667 std: 0.00502493781056
-------------------------------
n: 500
0: 0.61
1: 0.598333333333
2: 0.6
3: 0.608333333333
4: 0.608333333333
5: 0.613333333333
6: 0.595
7: 0.6
8: 0.59
9: 0.598333333333
n: 500 mean: 0.602166666667 std: 0.00707303172464
-------------------------------
n: 600
0: 0.605
1: 0.596666666667
2: 0.61
3: 0.603333333333
4: 0.596666666667
5: 0.588333333333
6: 0.598333333333
7: 0.588333333333
8: 0.6
9: 0.6
n: 600 mean: 0.598666666667 std: 0.0064463598686
-------------------------------
n: 700
0: 0.593333333333
1: 0.605
2: 0.595
3: 0.596666666667
4: 0.603333333333
5: 0.61
6: 0.598333333333
7: 0.601666666667
8: 0.6
9: 0.601666666667
n: 700 mean: 0.6005 std: 0.00471699056603

Please let me know what is the correct way of evaluating the performance of Random Forest.

Best Answer

In order to obtain reproducible randomizations you have to set up a random seed beforehand, e.g. in R you do set.seed(17) where 17 is just a number I made up. When you do it you should get the same accuracy every random forest run when you fix the number of trees.

I don't think your accuracy really decreases when you increase the number of trees. The accuracy results with a few trees have higher variances and even if the mean value is slightly higher (in your case little higher) this does not mean they are better. Your results might be due to different randomization of the data. Try to set up a random seed beforehand.

If you don't have a specific set of records to use as test set you might obtain more stable results computing the accuracy by cross validation. A 10-fold cross validation is usually the measure of choice. For a better estimate you can also average the results of many cross validation with different randomizations. It might be time consuming, an average among 5 randomizations of 2-fold cross validation might be good practice if you don't have too few records. If you have a few records, for example less then 100, you can try the leave one out method.

If you want to learn more about classifier evaluation the book Evaluating Learning Algorithms: A Classification Perspective by Nathalie Japkowicz is a good reference.