Solved – Random Forest: more trees gave worse results

random forest

I'm new to ML and decided to start learning by having a go at number recognition using the MNIST subset on kaggle http://www.kaggle.com/c/digit-recognizer

The images are 28×28 pixel greyscale of the digits from 0 to 9, with ~34000 in the training set and 28000 in the test set. Random Forest using the raw image and 2000 trees gives a score of 0.96829.

I did some preprocessing of the images to extract more features and trained / tested on a equal sized subsets of the training data. Compared to normal random forest for subsets around 1000 – 3000 I was getting about a 3-4% improvement (e.g. 94% vs 90% for RF)

After training with the entire training set for 200 trees, I scored 0.97557 (a ~0.7% improvement on RF). However, increasing to 2000 trees I scored 0.97457 (0.1% less).

  • Does this mean the features are 'saturated'? i.e. any minor performance gain / loss is just random fluctuations?
  • Is there anything I can try (apart from more features) to improve the result?
  • Is there any way to weight some features more than others so they are more likely to be evaluated?

Best Answer

It probably means that you are overcorrelating your ensemble. Random forest works because it is based on highly independent trees achieved through randomizing the sample on which the trees are built and the candidate variables for each split of each tree. Because of the Jensen's inequality you know that the error of the average of the prediction of your tree ensemble will always be smaller or equal than the average error of your individual trees. It must be noticed that the loss function here must be convex and that the "guesses" must be independent. If you make too many trees there comes a point where this is no longer true.

It could also be, as you mentioned, because of the random nature of the random forest model. Since they are based on bootstrapping.

Related Question