Random Forest – Is Preprocessing Needed Before Using FinalModel with Caret Package

caretpredictionrrandom forest

I use the caret package for training a randomForest object with 10x10CV.

library(caret)
tc <- trainControl("repeatedcv", number=10, repeats=10, classProbs=TRUE, savePred=T) 
RFFit <- train(Defect ~., data=trainingSet, method="rf", trControl=tc, preProc=c("center", "scale"))

After that, I test the randomForest on a testSet (new data)

RF.testSet$Prediction <- predict(RFFit, newdata=testSet)

The confusion matrix shows me, that the model isn't that bad.

confusionMatrix(data=RF.testSet$Prediction, RF.testSet$Defect)
              Reference
    Prediction   0   1
             0 886 179
             1  53 126  

      Accuracy : 0.8135          
             95% CI : (0.7907, 0.8348)
No Information Rate : 0.7548          
P-Value [Acc > NIR] : 4.369e-07       

              Kappa : 0.4145

I now want to test the $finalModel and I think it should give me the same result, but somehow I receive

> RF.testSet$Prediction <- predict(RFFit$finalModel, newdata=RF.testSet)
>  confusionMatrix(data=RF.testSet$Prediction, RF.testSet$Defect)
Confusion Matrix and Statistics

          Reference
Prediction   0   1
         0 323  66
         1 616 239

               Accuracy : 0.4518          
                 95% CI : (0.4239, 0.4799)
    No Information Rate : 0.7548          
    P-Value [Acc > NIR] : 1               

                  Kappa : 0.0793

What am I missing?

edit @topepo :

I also learned another randomForest without the preProcessed option and got another result:

RFFit2 <- train(Defect ~., data=trainingSet, method="rf", trControl=tc)
testSet$Prediction2 <- predict(RFFit2, newdata=testSet)
confusionMatrix(data=testSet$Prediction2, testSet$Defect)

Confusion Matrix and Statistics

          Reference
Prediction   0   1
         0 878 174
         1  61 131

               Accuracy : 0.8111          
                 95% CI : (0.7882, 0.8325)
    No Information Rate : 0.7548          
    P-Value [Acc > NIR] : 1.252e-06       

                  Kappa : 0.4167

Best Answer

The difference is the pre-processing. predict.train automatically centers and scales the new data (since you asked for that) while predict.randomForest takes whatever it is given. Since the tree splits are based on the processed values, the predictions will be off.

Max

Related Solutions

Solved – Number of principal components when preprocessing using PCA in caret package in R

By default, caret keeps the components that explain 95% of the variance.
But you can change it by using the thresh parameter.

# Example
preProcess(training, method = "pca", thresh = 0.8)

You can also set a particular number of components by setting the pcaComp parameter.

# Example
preProcess(training, method = "pca", pcaComp = 7)

If you use both parameters, pcaComp has precedence over thresh.

Please see: https://www.rdocumentation.org/packages/caret/versions/6.0-77/topics/preProcess

Solved – Different results from randomForest via caret and the basic randomForest package

I think the question while somewhat trivial and "programmatic" at first read touches upon two main issues that very important in modern Statistics:

reproducibility of results and
non-deterministic algorithms.

The reason for the different results is that the two procedure are trained using different random seeds. Random forests uses a random subset from the full-dataset's variables as candidates at each split (that's the mtry argument and relates to the random subspace method) as well as bags (bootstrap aggregates) the original dataset to decrease the variance of the model. These two internal random sampling procedures thought are not deterministic between different runs of the algorithm. The random order which the sampling is done is controlled by the random seeds used. If the same seeds were used, one would get the exact same results in both cases where the randomForest routine is called; both internally in caret::train as well as externally when fitting a random forest manually. I attach a simple code snippet to show-case this. Please note that I use a very small number of trees (argument: ntree) to keep training fast, it should be generally much larger.

library(caret)

set.seed(321)
trainData <- twoClassSim(5000, linearVars = 3, noiseVars = 9)
testData  <- twoClassSim(5000, linearVars = 3, noiseVars = 9)

set.seed(432)
mySeeds <- sapply(simplify = FALSE, 1:26, function(u) sample(10^4, 3))
cvCtrl = trainControl(method = "repeatedcv", number = 5, repeats = 5, 
                      classProbs = TRUE, summaryFunction = twoClassSummary, 
                      seeds = mySeeds)

fitRFcaret = train(Class ~ ., data = trainData, trControl = cvCtrl, 
                   ntree = 33, method = "rf", metric="ROC")

set.seed( unlist(tail(mySeeds,1))[1])
fitRFmanual <- randomForest(Class ~ ., data=trainData, 
                            mtry = fitRFcaret$bestTune$mtry, ntree=33)

At this point both the caret.train object fitRFcaret as well as the manually defined randomForest object fitRFmanual have been trained using the same data but importantly using the same random seeds when fitting their final model. As such when we will try to predict using these objects and because we do no preprocessing of our data we will get the same exact answers.

all.equal(current =  as.vector(predict(fitRFcaret, testData)), 
          target = as.vector(predict(fitRFmanual, testData)))
# TRUE

Just to clarify this later point a bit further: predict(xx$finalModel, testData) and predict(xx, testData) will be different if one sets the preProcess option when using train. On the other hand, when using the finalModel directly it is equivalent using the predict function from the model fitted (predict.randomForest here) instead of predict.train; no pre-proessing takes place. Obviously in the scenario outlined in the original question where no pre-processing is done the results will be the same when using the finalModel, the manually fitted randomForest object or the caret.train object.

all.equal(current =  as.vector(predict(fitRFcaret$finalModel, testData)), 
          target = as.vector(predict(fitRFmanual, testData)))
 # TRUE

all.equal(current =  as.vector(predict(fitRFcaret$finalModel, testData)),
          target = as.vector(predict(fitRFcaret, testData)))
# TRUE

I would strongly suggest that you always set the random seed used by R, MATLAB or any other program used. Otherwise, you cannot check the reproducibility of results (which OK, it might not be the end of the world) nor exclude a bug or external factor affecting the performance of a modelling procedure (which yeah, it kind of sucks). A lot of the leading ML algorithms (eg. gradient boosting, random forests, extreme neural networks) do employ certain internal resampling procedures during their training phases, setting the random seed states prior (or sometimes even within) their training phase can be important.

Best Answer

Related Solutions

Solved – Number of principal components when preprocessing using PCA in caret package in R

Solved – Different results from randomForest via caret and the basic randomForest package

Related Question