Random Forest – How to Handle Overfitting Effectively

overfittingrandom forest

I have a computer science background but am trying to teach myself data science by solving problems on the internet.

I have been working on this problem for the last couple of weeks (approx 900 rows and 10 features). I was initially using logistic regression but now I have switched to random forests. When I run my random forest model on my training data I get really high values for auc (> 99%). However when I run the same model on the test data the results are not so good (Accuracy of approx 77%). This leads me to believe that I am over fitting the training data.

What are the best practices regarding preventing over fitting in random forests?

I am using r and rstudio as my development environment. I am using the randomForest package and have accepted defaults for all parameters

Best Answer

To avoid over-fitting in random forest, the main thing you need to do is optimize a tuning parameter that governs the number of features that are randomly chosen to grow each tree from the bootstrapped data. Typically, you do this via $k$-fold cross-validation, where $k \in \{5, 10\}$, and choose the tuning parameter that minimizes test sample prediction error. In addition, growing a larger forest will improve predictive accuracy, although there are usually diminishing returns once you get up to several hundreds of trees.