Solved – Random forest: overfitting even OOB error is low

overfittingrandom forest

Is there any case that OOB ( out of bag) error fails to indicate overfitting? For example OOB is still good but the RF is overfitted.

More specifically,I got low OOB error (8%) with a data set with a lot of wrong labels (i.e. Two samples with very identical feature values may be in different classes and vice versa). The wrong label rate is around 20% of the data set of 7000 samples. The OOB was calculated during training and thus it was based on the wrong labels as well. One possibility may be that the RF was able to learn a very nonlinear cut between two wrong classes even the two classes have a significant overlap. But I want to know if there are any other posibilities.

Thank you.

Best Answer

Elaborating on what @MichaelM said in the comments:

I know of one fairly common situation which in which the OOB error can be extremely misleading. This is when there are many duplicate rows in your training data. If one of the duplicates is in the bag, and the other is out, then the one which is out is very likely to be predicted correctly, which makes the model look better than it is.

I recently ran into this problem when evaluating a model which someone had built after up-sampling the data using SMOTE. The OOB error looked fantastic, but the model was almost useless for classifying unseen data.

Here's a simple example in R of the kind of thing that can happen.

# completely random data
dat <- data.frame(x1=rnorm(1000), x2=rnorm(1000), y=factor(sample(0:1, 1000, replace=T)))
require(randomForest)

# build model after "over-sampling"
dat2 <- rbind(dat, dat)
model <- randomForest(y ~., data=dat2)
model