Solved – Random Forest: training error vs oob error

random forestvalidation

When creating a random forest model, the training error is different from the out-of-bag error.

The out-of-bag error of the model is based on the predictions:

predict(model)

The training error of the model (rf) is based on the prediction:

predict(model,newdata=training_dataset)

Now my question is: if we create a random forest model, and saved the model. Then we loose the training_dataset… Then one day, we meet that training dataset again. How we can evaluate the training dataset?

Or we can consider a less dramatic situation: in the test dataset, there is an observation that is the same as we can find in the training dataset, how can we predict ?
(if we know that it is from the training dataset, we would take the out-of-bag error, but now we consider that it is a new observation.)

More concretely, if obs_1=training_dataset[1,], we can calculate the prediction in two ways:

predict(model)[1]

or

predict(model,newdata=obs_1)

The result will be different. Which one should we consider ? For me, we should choose the first one (because there is a risk of overfitting in the second one).

And now, let's consider that in the this observation, only one variable changes value, let it be obs_1bis (let's say it is a numerical variable, the initial value is 1, and in the new observation, it is 1.001).
Then the prediction will be very close to predict(model,newdata=obs_1) but it should be closer to predict(model)[1] if the previous consideration is correct.

EDIT:

If the oob error is, let's say, 10%. And the error based on predict(model,newdata=training_dataset) is 0%. Should we conclude that the model is heavily overfitted?

Untill now, I only look the oob error, and in the summary of the model of the R package, we only see this OOB estimate of error rate. Then using a test set data, the error rate would be not far from this error rate (10%).

Then I realized that if one observation is from the training dataset, and we consider newdata argument, then its prediction is different, hence my question.

Best Answer

Then one day, we meet that training dataset again. How we can evaluate the training dataset?

predict(model,newdata=training_dataset) will work in the usual way. Giving newdata to the model just makes a prediction for each row in newdata.

The object model stores the predictions for the OOB data, so losing the training data won't change the saved model.

Or we can consider a less dramatic situation: in the test dataset, there is an observation that is the same as we can find in the training dataset, how can we predict ?

The predictions happen in the usual way. Each tree in model makes a vote, and the votes are added up. The model doesn't care.

If you're very worried that having the same observations in train and test partitions will distort the model, then you can work out a stratification scheme to put all identical observations in either train or test.

The result will be different. Which one should we consider ?

It depends. OOB data is a way to simulate out-of-sample data using the training set. In the other hand, if you want a prediction that uses all of the trees (perhaps because you've deployed the model and need to apply it to new data), then you'd used predict(model, newdata=...).

For me, we should choose the first one (because there is a risk of overfitting in the second one).

Overfitting is a property of the model itself, not the mode you chose for predictions.

A prediction from OOB data might be overfit if the model is overfit. Or it might not be, if the model is not overfit. Either way, choosing predict(model, newdata=...) or predict(model) isn't a toggle that fixes overfitting.

And now, let's consider that in the this observation, only one variable changes value, let it be obs_1bis (let's say it is a numerical variable, the initial value is 1, and in the new observation, it is 1.001). Then the prediction will be very close to predict(model,newdata=obs_1) but it should be closer to predict(model)[1] if the previous consideration is correct.

We won't know ahead of time if a small change to 1 feature will cause a small or large change in the result, because trees are highly discontinuous. Decision trees find splits in the features -- if all of the trees split on this feature between 1.0 and 1.001, then the predictions could be very different. Or if the changed feature is never used, the predictions could be identical.