I use the caret package for training a randomForest object with 10x10CV.
library(caret)
tc <- trainControl("repeatedcv", number=10, repeats=10, classProbs=TRUE, savePred=T)
RFFit <- train(Defect ~., data=trainingSet, method="rf", trControl=tc, preProc=c("center", "scale"))
After that, I test the randomForest on a testSet (new data)
RF.testSet$Prediction <- predict(RFFit, newdata=testSet)
The confusion matrix shows me, that the model isn't that bad.
confusionMatrix(data=RF.testSet$Prediction, RF.testSet$Defect)
Reference
Prediction 0 1
0 886 179
1 53 126
Accuracy : 0.8135
95% CI : (0.7907, 0.8348)
No Information Rate : 0.7548
P-Value [Acc > NIR] : 4.369e-07
Kappa : 0.4145
I now want to test the $finalModel and I think it should give me the same result, but somehow I receive
> RF.testSet$Prediction <- predict(RFFit$finalModel, newdata=RF.testSet)
> confusionMatrix(data=RF.testSet$Prediction, RF.testSet$Defect)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 323 66
1 616 239
Accuracy : 0.4518
95% CI : (0.4239, 0.4799)
No Information Rate : 0.7548
P-Value [Acc > NIR] : 1
Kappa : 0.0793
What am I missing?
edit @topepo :
I also learned another randomForest without the preProcessed option and got another result:
RFFit2 <- train(Defect ~., data=trainingSet, method="rf", trControl=tc)
testSet$Prediction2 <- predict(RFFit2, newdata=testSet)
confusionMatrix(data=testSet$Prediction2, testSet$Defect)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 878 174
1 61 131
Accuracy : 0.8111
95% CI : (0.7882, 0.8325)
No Information Rate : 0.7548
P-Value [Acc > NIR] : 1.252e-06
Kappa : 0.4167
Best Answer
The difference is the pre-processing.
predict.train
automatically centers and scales the new data (since you asked for that) whilepredict.randomForest
takes whatever it is given. Since the tree splits are based on the processed values, the predictions will be off.Max