Solved – R caret Naive Bayes (untuned) results differ from klaR

carete1071machine learningr

I'm running a naive bayes classification model and I noticed that the caret package returns a different result than does klaR (which caret references) or e1071.

My question is: is there something I'm doing wrong with my caret syntax that I'm not able to recover the same results as with klaR (or e1071)?

Please note that I am running an untuned model in caret and providing it with the same specifications that I'm providing to klaR (usekernal = FALSE and fL = 0).

In the example below, both e1071 and klaR return identical confusion matrices (which makes sense given that klaR is based on e1071 but adds the kernal and laplace smoother, which I've disabled here). Oddly, when caret is asked to run an untuned model with the same specification as the klaR model, the results are close but not identical but I would expect the results to be identical to klaR.

Here's a reproducible example:

# Load Libraries
library(kernlab); #for spam data
library(caret)
library(e1071)
library(klaR)

# Load Data
data(spam)

# e1071 naiveBayes
set.seed(3456)
fit1 <- naiveBayes(spam, spam$type, type="raw")
    pred1 <- predict(fit1, spam, type="class")
    confusionMatrix(pred1, spam$type)

# klaR NaiveBayes
set.seed(3456)
fit2 <- NaiveBayes(spam, spam$type, usekernal = FALSE, fL = 0)
    pred2 <- predict(fit2, spam)
    #Warnings that probability is 0 for some cases
    confusionMatrix(pred2$class, spam$type)

# caret with no tuning, usekernal = FALSE, fL = 0
set.seed(3456)
fit3 <- train(type ~ ., 
         data=spam,
         method = "nb",
         trControl = trainControl(method="none"),
         tuneGrid = data.frame(fL=0, usekernel=FALSE))

pred3 <- predict(fit3, spam, type="raw")
#Warnings that probability is 0 for some cases
confusionMatrix(pred3, spam$type)

Here are selected outputs from the confusion matrices.

For e1071:

Accuracy : 0.7266
Sensitivity : 0.5814          
Specificity : 0.9498         

For klaR:

Accuracy : 0.7266
Sensitivity : 0.5814          
Specificity : 0.9498 

For caret:

Accuracy : 0.7135
Sensitivity : 0.5610          
Specificity : 0.9482 

Any information on why this is happening and what, if anything, I can do about it, is greatly appreciated.

Thanks!

EDIT: Just in case this is helpful, from sessionInfo()

R version 3.2.2 (2015-08-14)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 8 x64 (build 9200)

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] klaR_0.6-12     MASS_7.3-45     e1071_1.6-7     caret_6.0-58    ggplot2_1.0.1   lattice_0.20-33 kernlab_0.9-22 

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.1        magrittr_1.5       splines_3.2.2      munsell_0.4.2      colorspace_1.2-6   foreach_1.4.3     
 [7] minqa_1.2.4        stringr_1.0.0      car_2.1-0          plyr_1.8.3         tools_3.2.2        parallel_3.2.2    
[13] nnet_7.3-11        pbkrtest_0.4-2     grid_3.2.2         gtable_0.1.2       nlme_3.1-122       mgcv_1.8-7        
[19] quantreg_5.19      class_7.3-14       MatrixModels_0.4-1 iterators_1.0.8    lme4_1.1-10        digest_0.6.8      
[25] Matrix_1.2-2       nloptr_1.0.4       reshape2_1.4.1     codetools_0.2-14   stringi_1.0-1      scales_0.3.0      
[31] combinat_0.0-8     stats4_3.2.2       SparseM_1.7        proto_0.3-10  

Best Answer

The problem lies in the fact that you use a different specification in the models. In fit1 and fit2 you use the x and y combination, in fit3 the formula notation

If you switch all models in the formula notation (type ~ ., data = spam) you will see an Accuracy of 0.7135

If you switch all models in the x / y notation (spam, spam$type) you will see an Accuracy of 0.7266

There is probably someone who can explain why exactly this difference occurs. I have no idea except that it has something to do in the difference of how the S3 formula notation is processed versus the default notation of x and y.