Solved – Different results from randomForest via caret and the basic randomForest package

caretmachine learningrrandom foresttrain

I am a bit confused: How can the results of a trained Model via caret differ from the model in the original package? I read Whether preprocessing is needed before prediction using FinalModel of RandomForest with caret package? but I do not use any preprocessing here.

I trained different Random Forests by using the caret package and tuning for different mtry values.

> cvCtrl = trainControl(method = "repeatedcv",number = 10, repeats = 3, classProbs = TRUE, summaryFunction = twoClassSummary)
> newGrid = expand.grid(mtry = c(2,4,8,15))
> classifierRandomForest = train(case_success ~ ., data = train_data, trControl = cvCtrl, method = "rf", metric="ROC", tuneGrid = newGrid)
> curClassifier = classifierRandomForest

I found mtry=15 to be the best parameter on the training_data:

> curClassifier
 ...
Resampling results across tuning parameters:

mtry  ROC    Sens   Spec   ROC SD   Sens SD  Spec SD
 4    0.950  0.768  0.957  0.00413  0.0170   0.00285
 5    0.951  0.778  0.957  0.00364  0.0148   0.00306
 8    0.953  0.792  0.956  0.00395  0.0152   0.00389
10    0.954  0.797  0.955  0.00384  0.0146   0.00369
15    0.956  0.803  0.951  0.00369  0.0155   0.00472

ROC was used to select the optimal model using  the largest value.
The final value used for the model was mtry = 15.

I assessed the model with an ROC Curve and a confusion matrix:

##ROC-Curve
predRoc = predict(curClassifier, test_data, type = "prob")
myroc = pROC::roc(test_data$case_success, as.vector(predRoc[,2]))
plot(myroc, print.thres = "best")

##adjust optimal cut-off threshold for class probabilities
threshold = coords(myroc,x="best",best.method = "closest.topleft")[[1]] #get optimal cutoff threshold
predCut = factor( ifelse(predRoc[, "Yes"] > threshold, "Yes", "No") )


##Confusion Matrix (Accuracy, Spec, Sens etc.)
curConfusionMatrix = confusionMatrix(predCut, test_data$case_success, positive = "Yes")

The resulting Confusion Matrix and Accuracy:

Confusion Matrix and Statistics
      Reference
Prediction   No  Yes
   No  2757  693
   Yes  375 6684

           Accuracy : 0.8984
 ....

Now I trained a Random Rorest with the same parameters and same training_data using the basic randomForest package:

randomForestManual <- randomForest(case_success ~ ., data=train_data, mtry = 15, ntree=500,keep.forest=TRUE)
curClassifier = randomForestManual

Again I created predictions for the very same test_data as above and assessed the confusion matrix with the same code as above. But now I got different measures:

Confusion Matrix and Statistics

      Reference
Prediction   No  Yes
       No  2702  897
       Yes  430 6480

           Accuracy : 0.8737 
           ....

What is the reason? What am I missing?

Best Answer

I think the question while somewhat trivial and "programmatic" at first read touches upon two main issues that very important in modern Statistics:

reproducibility of results and
non-deterministic algorithms.

The reason for the different results is that the two procedure are trained using different random seeds. Random forests uses a random subset from the full-dataset's variables as candidates at each split (that's the mtry argument and relates to the random subspace method) as well as bags (bootstrap aggregates) the original dataset to decrease the variance of the model. These two internal random sampling procedures thought are not deterministic between different runs of the algorithm. The random order which the sampling is done is controlled by the random seeds used. If the same seeds were used, one would get the exact same results in both cases where the randomForest routine is called; both internally in caret::train as well as externally when fitting a random forest manually. I attach a simple code snippet to show-case this. Please note that I use a very small number of trees (argument: ntree) to keep training fast, it should be generally much larger.

library(caret)

set.seed(321)
trainData <- twoClassSim(5000, linearVars = 3, noiseVars = 9)
testData  <- twoClassSim(5000, linearVars = 3, noiseVars = 9)

set.seed(432)
mySeeds <- sapply(simplify = FALSE, 1:26, function(u) sample(10^4, 3))
cvCtrl = trainControl(method = "repeatedcv", number = 5, repeats = 5, 
                      classProbs = TRUE, summaryFunction = twoClassSummary, 
                      seeds = mySeeds)

fitRFcaret = train(Class ~ ., data = trainData, trControl = cvCtrl, 
                   ntree = 33, method = "rf", metric="ROC")

set.seed( unlist(tail(mySeeds,1))[1])
fitRFmanual <- randomForest(Class ~ ., data=trainData, 
                            mtry = fitRFcaret$bestTune$mtry, ntree=33)

At this point both the caret.train object fitRFcaret as well as the manually defined randomForest object fitRFmanual have been trained using the same data but importantly using the same random seeds when fitting their final model. As such when we will try to predict using these objects and because we do no preprocessing of our data we will get the same exact answers.

all.equal(current =  as.vector(predict(fitRFcaret, testData)), 
          target = as.vector(predict(fitRFmanual, testData)))
# TRUE

Just to clarify this later point a bit further: predict(xx$finalModel, testData) and predict(xx, testData) will be different if one sets the preProcess option when using train. On the other hand, when using the finalModel directly it is equivalent using the predict function from the model fitted (predict.randomForest here) instead of predict.train; no pre-proessing takes place. Obviously in the scenario outlined in the original question where no pre-processing is done the results will be the same when using the finalModel, the manually fitted randomForest object or the caret.train object.

all.equal(current =  as.vector(predict(fitRFcaret$finalModel, testData)), 
          target = as.vector(predict(fitRFmanual, testData)))
 # TRUE

all.equal(current =  as.vector(predict(fitRFcaret$finalModel, testData)),
          target = as.vector(predict(fitRFcaret, testData)))
# TRUE

I would strongly suggest that you always set the random seed used by R, MATLAB or any other program used. Otherwise, you cannot check the reproducibility of results (which OK, it might not be the end of the world) nor exclude a bug or external factor affecting the performance of a modelling procedure (which yeah, it kind of sucks). A lot of the leading ML algorithms (eg. gradient boosting, random forests, extreme neural networks) do employ certain internal resampling procedures during their training phases, setting the random seed states prior (or sometimes even within) their training phase can be important.

Related Solutions

Solved – Confusion between caret randomForest predict() results and reported model performance

The summary printed for the model contains the line

6     0.76      0.68   0.0507       0.068

which tells you that the expected/average accuracy for a proprley cross-validaded (training separated from testing) experiment should be 0.76

I have never used the line

model$pred[model$pred$mtry == 6, c("pred", "obs")

before but I guess it is giving you the aggregated results of all the internal cross-validations done when testing for mtry=6. You get a 0.7893916 which is pretty close to 0.76.

Caret, by default also generates the final model with all the training data provided, which is the model used in the line

pred=predict(model, data_pred_scale),

so what is curious is that the random forest generated gets a 100% accuracy when tested with the data used to train it. It is not impossible, of course, but just curious.

This phenomenon is not technically called overfitting, it goes beyond that - I do not know any good reason to test a classifier on the data used to train it.

Solved – Random Forest confusion matrix

To address both your questions.

The discrepancy between rfFit and rfFit$finalModel

I believe it is normal to have some discrepancy between your rfFit and rfFit$finalModel. As you can see in the output from rfFit there is also a Accuracy SD column. Your Accuracy returned here is an average as a result of your repeated cross validation. The Accuracy returned by rfFit$finalModel is a single model fit with the best parameters determined by your CV (which you may notice is within 1 SD of your accuracy. As noted by topepo below, it is also a different metric whereby the former is by class predictions, the latter is by OOB.

Why perfect prediction with training samples?

This appears to be a common concern. What you have done here is develop the best model to classify your training samples. Random forest is especially good at classification. That said, you have just trained the model to fit these exact samples. Therefore, it very likely you will have an inflated accuracy when fitting the same samples (especially with random forest in my personal experience). What you should do, pending the size of your initial dataset, is subset a testing group that will entirely independent of your training samples. That way you can apply your newly optimized model on some samples that were not part of the tuning process. Ideally you would have a completely separate dataset to evaluate but often people don't have that luxury.