Solved – Out-of-bag error and error on test dataset for random forest

overfittingrandom forest

Recently I'm working with random forest algorithms, due to their easy to use. I always devide my set into train and test subsets, usually out of bag error for forest build on train dataset is higher (by more then 10%) then on test dataset. I wonder if it implicate overfitting or is it natural, should those two errors be equal ? If so I think I should choose parameters of forest (like maximum depth or maximum number of observations in termial node) to obtain similar values of errors.

Best Answer

I understand your question to be (correct me if I'm wrong) that:

  1. You are training a random forest (RF).
  2. You have randomly divided your data into train and test sets.
  3. The measured performance of the RF is obtained through cross-validation on your train set.
  4. You then take the RF produced from your train dataset and look at its performance on your test set.
  5. Sometimes your performance on the test set is better than the average performance obtained through cross-validation on the train set.

The following points are worth noting:

  • The RF you are applying to the test set is trained on more data than the RFs used in cross-validation. Depending on how much data you have, we may expect this first RF to have better performance.
  • The test estimate of performance is an estimate using one data point, the estimate of performance on your train set is the average of multiple data points. You don't have a sense of uncertainty on the test set's estimated performance (at least, not through my understanding of your procedure). You certainly may be in a situation where the estimated test performance is sometimes higher on the the estimated train performance, sometimes lower, but not statistically significant in its difference.