Random Forest Overfitting – How to Identify Overfitting in R

overfittingrrandom forest

I used a two-step cforest in my model. the accuracy of the train set is 87%, yet the accuracy of the test set is 57%. This indicates the model is severely overfitting. How to solve this problem? Should I reduce the nodes of the tree or divide the data into k fold? How can I determine how many nodes should I retain?

Here is the code for step 1.

   fit1 <- cforest((b == 'three')~ posemo + social + family
            +friend + home + humans + money + they
            + social+article+certain+insight+affect+ negemo+ future+swear+sad
            +negate+ppron+sexual+death + filler+leisure, data = trainset1, 
            controls=cforest_unbiased(ntree=3000, mtry= 3))

Best Answer

In random forests, overfitting is generally caused by over growing the trees

as stated in one of the other answers is completely WRONG. The RF algorithm, by definition, requires fully grown unprunned trees. This is the case because RF can only reduce variance, not bias (where $error=bias+variance$). Since the bias of the entire forest is roughly equal to the bias of a single tree, the base model used has to be a very deep tree to guarantee a low bias. Variance is subsequently reduced by growing many deep, uncorrelated trees and averaging their predictions.

I wouldn't necessarily say that a training accuracy of 87% and a test accuracy of 57% indicates severe overfitting. Performance on your training set will always be higher than on your test set. Now, you need to provide more information if you want CV users to be able to diagnose the source of your potential overfitting problem.

  • how did you tune the parameters of your random forest model? Did you use cross-validation, or an independent test set? What are the sizes of your training/testing sets? Did you properly used randomization to constitute these sets?
  • is your target categorical or continuous? If yes to the former, do you have any kind of class imbalance issue?
  • how did you measure error? If it applies, is your classification problem binary, or multiclass?

In practice, Random Forest seldom overfit. But what would tend to favor overfitting would be having too many trees in the forest. At some point it is not necessary to keep adding trees (it does not reduce variance anymore, but can slightly increase it). This is why the optimal number of trees should be optimized like any other hyperparameter (or at least, should not be carelessly set to too high of a number. It should be the smallest number of trees needed to achieve lowest error. You can look at a plateau in the curve of OOB error VS number of trees).

Other than overfitting, the difference in accuracy between train & test that you observe could be explained by differences between the sets. Are the same concepts present in both sets? If not, even the best classifier won't be able to perform well out-of-bag. You can't extrapolate for something if you did not even learn about some aspect of it.

I would also recommend that you read the section about RF in the formative Elements of Statistical Learning. Especially, see section 15.3.4 (p. 596) about RF and overfitting.