I am a bit confused: How can the results of a trained Model via caret differ from the model in the original package? I read Whether preprocessing is needed before prediction using FinalModel of RandomForest with caret package? but I do not use any preprocessing here.
I trained different Random Forests by using the caret package and tuning for different mtry values.
> cvCtrl = trainControl(method = "repeatedcv",number = 10, repeats = 3, classProbs = TRUE, summaryFunction = twoClassSummary)
> newGrid = expand.grid(mtry = c(2,4,8,15))
> classifierRandomForest = train(case_success ~ ., data = train_data, trControl = cvCtrl, method = "rf", metric="ROC", tuneGrid = newGrid)
> curClassifier = classifierRandomForest
I found mtry=15 to be the best parameter on the training_data:
> curClassifier
...
Resampling results across tuning parameters:
mtry ROC Sens Spec ROC SD Sens SD Spec SD
4 0.950 0.768 0.957 0.00413 0.0170 0.00285
5 0.951 0.778 0.957 0.00364 0.0148 0.00306
8 0.953 0.792 0.956 0.00395 0.0152 0.00389
10 0.954 0.797 0.955 0.00384 0.0146 0.00369
15 0.956 0.803 0.951 0.00369 0.0155 0.00472
ROC was used to select the optimal model using the largest value.
The final value used for the model was mtry = 15.
I assessed the model with an ROC Curve and a confusion matrix:
##ROC-Curve
predRoc = predict(curClassifier, test_data, type = "prob")
myroc = pROC::roc(test_data$case_success, as.vector(predRoc[,2]))
plot(myroc, print.thres = "best")
##adjust optimal cut-off threshold for class probabilities
threshold = coords(myroc,x="best",best.method = "closest.topleft")[[1]] #get optimal cutoff threshold
predCut = factor( ifelse(predRoc[, "Yes"] > threshold, "Yes", "No") )
##Confusion Matrix (Accuracy, Spec, Sens etc.)
curConfusionMatrix = confusionMatrix(predCut, test_data$case_success, positive = "Yes")
The resulting Confusion Matrix and Accuracy:
Confusion Matrix and Statistics
Reference
Prediction No Yes
No 2757 693
Yes 375 6684
Accuracy : 0.8984
....
Now I trained a Random Rorest with the same parameters and same training_data using the basic randomForest package:
randomForestManual <- randomForest(case_success ~ ., data=train_data, mtry = 15, ntree=500,keep.forest=TRUE)
curClassifier = randomForestManual
Again I created predictions for the very same test_data as above and assessed the confusion matrix with the same code as above. But now I got different measures:
Confusion Matrix and Statistics
Reference
Prediction No Yes
No 2702 897
Yes 430 6480
Accuracy : 0.8737
....
What is the reason? What am I missing?
Best Answer
I think the question while somewhat trivial and "programmatic" at first read touches upon two main issues that very important in modern Statistics:
The reason for the different results is that the two procedure are trained using different random seeds. Random forests uses a random subset from the full-dataset's variables as candidates at each split (that's the
mtry
argument and relates to the random subspace method) as well as bags (bootstrap aggregates) the original dataset to decrease the variance of the model. These two internal random sampling procedures thought are not deterministic between different runs of the algorithm. The random order which the sampling is done is controlled by the random seeds used. If the same seeds were used, one would get the exact same results in both cases where therandomForest
routine is called; both internally incaret::train
as well as externally when fitting a random forest manually. I attach a simple code snippet to show-case this. Please note that I use a very small number of trees (argument:ntree
) to keep training fast, it should be generally much larger.At this point both the
caret.train
objectfitRFcaret
as well as the manually definedrandomForest
objectfitRFmanual
have been trained using the same data but importantly using the same random seeds when fitting their final model. As such when we will try to predict using these objects and because we do no preprocessing of our data we will get the same exact answers.Just to clarify this later point a bit further:
predict(xx$finalModel, testData)
andpredict(xx, testData)
will be different if one sets thepreProcess
option when usingtrain
. On the other hand, when using thefinalModel
directly it is equivalent using thepredict
function from the model fitted (predict.randomForest
here) instead ofpredict.train
; no pre-proessing takes place. Obviously in the scenario outlined in the original question where no pre-processing is done the results will be the same when using thefinalModel
, the manually fittedrandomForest
object or thecaret.train
object.I would strongly suggest that you always set the random seed used by R, MATLAB or any other program used. Otherwise, you cannot check the reproducibility of results (which OK, it might not be the end of the world) nor exclude a bug or external factor affecting the performance of a modelling procedure (which yeah, it kind of sucks). A lot of the leading ML algorithms (eg. gradient boosting, random forests, extreme neural networks) do employ certain internal resampling procedures during their training phases, setting the random seed states prior (or sometimes even within) their training phase can be important.