R – How to Create a Confusion Matrix for Every Fold from K-Fold Cross Validation

confusion matrixcross-validationr

I am attempting to run a 5-fold cross validation in R for a logistic regression model followed up by confusion matrices for each "fold." However, my code is only producing one confusion matrix.

forestcov <- c(45, 67, 35, 67, 12, 43, 75, 8, 34, 46)
numspecies <- c(3, 6, 4, 7, 2, 5, 8, 5, 3, 4)
outcome <- as.factor(c('no','no','yes','yes','no','yes',
                       'no', 'yes', 'yes','no'))

df <- data.frame(outcome, forestcov, numspecies)
library(caret)

#partition data
set.seed(123) 
index <- createDataPartition(df$outcome, p = .5, list = FALSE, times = 1) 
train_df <- df[index,] 
test_df <- df[-index,] 

#specify training methods
specifications <- trainControl(method = "cv", number = 5, 
                               savePredictions = "all", 
                               classProbs = TRUE) 

#specify logistic regression model
set.seed(1234) 
model1 <- train(outcome ~ forestcov + numspecies, 
                data=train_df,
                method = "glm",
                family = binomial, trControl = specifications)

#produce confusion matrix
predictions <- predict(model1, newdata = test_df)
confusionMatrix(data = predictions, test_df$outcome)

This produces one matrix. My goal is to run a 5-fold cross validation and produce a matrix for every fold. I can't figure out why I am only getting one matrix. Is the error in my cross validation or the confusion matrix, and how should I go about correcting it?

Best Answer

Your data is too small, so that some of the glm models trained on different folds fail to converge. With random synthetic data, I obtained the following results for $5$-fold CV, we need to define our summaryFunction and use returnResamp = 'all' in the trainControl, in order to obtain the confusion matrices for the models on the folds:

cfm <- function(data, lev = NULL, model = NULL) {
  cm <- confusionMatrix(table(data$pred, data$obs))
  print(cm)
  cm$byClass
}
specifications <- trainControl(method = "cv", number = 5, 
                               savePredictions = "all", 
                               returnResamp = 'all',
                               classProbs = TRUE,                                                                  
                               summaryFunction = cfm) #twoClassSummary 
set.seed(1234) 
model1 <- train(outcome ~ ., 
                data=train_df,
                method = "glm",
                family = binomial, 
                trControl = specifications)

It will print the confusion matrix for each fold

#for the 1st fold...
#Confusion Matrix and Statistics
#
#     
#      no yes
#  no   2   3
#  yes  3   2
#                                          
#               Accuracy : 0.4             
#                 95% CI : (0.1216, 0.7376)
#    No Information Rate : 0.5             
#    P-Value [Acc > NIR] : 0.8281          
#                                          
#                  Kappa : -0.2            
#                                          
# Mcnemar's Test P-Value : 1.0000          
#                                          
#            Sensitivity : 0.4             
#            Specificity : 0.4             
#         Pos Pred Value : 0.4             
#         Neg Pred Value : 0.4             
#             Prevalence : 0.5              
#         Detection Rate : 0.2             
#   Detection Prevalence : 0.5             
#      Balanced Accuracy : 0.4             
#                                          
#       'Positive' Class : no              
#                                       
#for the 2nd fold...
#Confusion Matrix and Statistics
# ...

The metrics can be obtained for each fold in the following way:

model1$resample
#  Sensitivity Specificity Pos Pred Value Neg Pred Value Precision    Recall        F1 Prevalence Detection Rate Detection Prevalence
#1   0.6274510   0.3469388      0.5000000      0.4722222 0.5000000 0.6274510 0.5565217       0.51           0.32                 0.64
#2   0.5490196   0.4081633      0.4912281      0.4651163 0.4912281 0.5490196 0.5185185       0.51           0.28                 0.57
#3   0.5686275   0.4285714      0.5087719      0.4883721 0.5087719 0.5686275 0.5370370       0.51           0.29                 0.57
#4   0.5098039   0.2857143      0.4262295      0.3589744 0.4262295 0.5098039 0.4642857       0.51           0.26                 0.61
#5   0.7843137   0.2653061      0.5263158      0.5416667 0.5263158 0.7843137 0.6299213       0.51           0.40                 0.76

With summaryFunction = twoClassSummary we obtain the following result for each fold:

#        ROC      Sens      Spec parameter Resample
#1 0.4649860 0.6274510 0.3469388      none    Fold1
#2 0.4697879 0.5490196 0.4081633      none    Fold2
#3 0.4597839 0.5686275 0.4285714      none    Fold3
#4 0.4457783 0.5098039 0.2857143      none    Fold4
#5 0.4709884 0.7843137 0.2653061      none    Fold5

Also, glm does not have any tunable hyper-parameter for caret it seems, you can use glmnet and tune the $\mathbb{L}_1$ and $\mathbb{L}_2$ regularization hyper-parameters of the elastic net with CV and obtain the best model w.r.t. the average CV score on the folds.