Solved – predictions with random forest in caret

caretmachine learningrandom forest

I recently got confused on how to do correct predictions for random forrests.
Here is an example:

library(caret)
n <- nrow(iris)
set.seed(44)
train_idx = sample(1:n, 0.8*n, replace = F)

traindat = iris[train_idx,]
testdat = iris[-train_idx,]


control  <- trainControl(method="repeatedcv", number=10, repeats=3)
tunegrid <- expand.grid(mtry = 1:3)

rf <- train(Sepal.Length ~ ., data = traindat, trControl = control, tuneGrid = tunegrid)

So basically I run a simple random forest to predict Sepal.Length. Now

> print(rf)
Random Forest 

120 samples
  4 predictor

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times) 
Summary of sample sizes: 107, 108, 107, 108, 106, 108, ... 
Resampling results across tuning parameters:

  mtry  RMSE       Rsquared   MAE      
  1     0.4121244  0.8035960  0.3294848
  2     0.3466976  0.8485902  0.2917225
  3     0.3340895  0.8547870  0.2811055

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was mtry = 3.

This tells me that the parameter mtry = 3 performed best on the training set with a RMSE of 0.33. I went on to check for this RMSEa and tried to calculate it by hand:

> sqrt(mean((predict.train(rf, newdata = traindat, type = "raw") - traindat$Sepal.Length)^2))
[1] 0.1736768

What did I do wrong? Is predict.train the right way to do predictions for random forests?

Glad to hear your advice.
Thank you.

Best Answer

As per documentation of train and trainControl, there is a sampling / cross-validation process which separates your training set into a "sub-training" set and a "sub-validation" set to build the model.

Default value for separation is 0.75, which means that at each iteration of the cross-validation, 75% of your values are used to build the sub-training set and the remaining 25% (sub-testing set) are tested. So the RMSE displayed in rf is the RMSE calculated on the sub-testing sets, based on the model built with the sub-validation sets (hence, distinct datasets for training and testing).

Obviously, the final model uses all your data with the optimal calculated parameters - in your case, mtry = 3. So when your are predicting your traindat with the final rf model and calculating the resulting RMSE, you are not comparing the same things. You get a lower RMSE because the data your are predicting is present in the model you have constructed, whereas it was not the case when train evaluated the performance of the models.

If you want to get the partition of your folds, set savePredictions parameter in your trainControl to TRUE or "all".

control  <- trainControl(method="repeatedcv", number=10, repeats=3, savePredictions = T)

You can then access to the partition through your train object, using pred element.

> head(rf$pred)
        pred obs rowIndex mtry    Resample
  1 5.766016 5.6        7    1 Fold01.Rep1
  2 5.732148 5.7       28    1 Fold01.Rep1
  3 4.939007 4.3       34    1 Fold01.Rep1
  4 5.002672 4.8       48    1 Fold01.Rep1
  5 6.756495 7.7       71    1 Fold01.Rep1
  6 6.354641 6.7       74    1 Fold01.Rep1
Related Question