I recently got confused on how to do correct predictions for random forrests.
Here is an example:
library(caret)
n <- nrow(iris)
set.seed(44)
train_idx = sample(1:n, 0.8*n, replace = F)
traindat = iris[train_idx,]
testdat = iris[-train_idx,]
control <- trainControl(method="repeatedcv", number=10, repeats=3)
tunegrid <- expand.grid(mtry = 1:3)
rf <- train(Sepal.Length ~ ., data = traindat, trControl = control, tuneGrid = tunegrid)
So basically I run a simple random forest to predict Sepal.Length. Now
> print(rf)
Random Forest
120 samples
4 predictor
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times)
Summary of sample sizes: 107, 108, 107, 108, 106, 108, ...
Resampling results across tuning parameters:
mtry RMSE Rsquared MAE
1 0.4121244 0.8035960 0.3294848
2 0.3466976 0.8485902 0.2917225
3 0.3340895 0.8547870 0.2811055
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was mtry = 3.
This tells me that the parameter mtry = 3 performed best on the training set with a RMSE of 0.33. I went on to check for this RMSEa and tried to calculate it by hand:
> sqrt(mean((predict.train(rf, newdata = traindat, type = "raw") - traindat$Sepal.Length)^2))
[1] 0.1736768
What did I do wrong? Is predict.train the right way to do predictions for random forests?
Glad to hear your advice.
Thank you.
Best Answer
As per documentation of
train
andtrainControl
, there is a sampling / cross-validation process which separates your training set into a "sub-training" set and a "sub-validation" set to build the model.Default value for separation is
0.75
, which means that at each iteration of the cross-validation, 75% of your values are used to build the sub-training set and the remaining 25% (sub-testing set) are tested. So the RMSE displayed inrf
is the RMSE calculated on the sub-testing sets, based on the model built with the sub-validation sets (hence, distinct datasets for training and testing).Obviously, the final model uses all your data with the optimal calculated parameters - in your case,
mtry = 3
. So when your are predicting yourtraindat
with the finalrf
model and calculating the resulting RMSE, you are not comparing the same things. You get a lower RMSE because the data your are predicting is present in the model you have constructed, whereas it was not the case whentrain
evaluated the performance of the models.If you want to get the partition of your folds, set
savePredictions
parameter in yourtrainControl
toTRUE
or"all"
.You can then access to the partition through your
train
object, usingpred
element.