I am using Random Forests in Matlab for regression. After educating my model on train data, I want to get MSE on test data not used in training. I do that two ways:
- call
predict
and directly calculate MSE using predicted and actual values - call
error
and use built in TreeBagger functionality to do the same task.
In first case I get 10 times bigger result. Why? The only explanation I have, is that built in function somehow discounts outliers in prediction, but I am not sure how exactly it is done.
Can somebody, please, explain all this to me.
Best Answer
If you use predict on a train set, the result will be highly overfitted and meaningless -- this actually applies to all ML algorithms, and this is why test sets, cross-validations and similar stuff is used.
However, one of the big advantages of bagging is that it can produce an approximation of error on train set as it would be a test -- this is called out-of-bag (OOB) and those predictions (available in
OOBPred
) are used by Matlab inerror
to produce "true" error.