Solved – Matlab RandomForest prediction error calculation

MATLABrandom forest

I am using Random Forests in Matlab for regression. After educating my model on train data, I want to get MSE on test data not used in training. I do that two ways:

  1. call predict and directly calculate MSE using predicted and actual values
  2. call error and use built in TreeBagger functionality to do the same task.

In first case I get 10 times bigger result. Why? The only explanation I have, is that built in function somehow discounts outliers in prediction, but I am not sure how exactly it is done.

Can somebody, please, explain all this to me.

Best Answer

If you use predict on a train set, the result will be highly overfitted and meaningless -- this actually applies to all ML algorithms, and this is why test sets, cross-validations and similar stuff is used.

However, one of the big advantages of bagging is that it can produce an approximation of error on train set as it would be a test -- this is called out-of-bag (OOB) and those predictions (available in OOBPred) are used by Matlab in error to produce "true" error.