R Programming – Interpreting Observations in Confusion Matrix after K-Fold Cross Validation

confusion matrixcross-validationr

I completed a 5-fold cross validation and produced a confusion matrix that (I believe) summarizes the results of the 5-fold cv; however, I don't understand why there are only 8 observations in the confusion matrix. My reproducible data set has 10 observations. I did an 80/20 training/testing split…But, I thought the confusion matrix would include all 10 observations, 2 from each test multiplied by the number of folds (5). Why are there only 8?

forestcov <- c(45, 67, 35, 67, 12, 43, 75, 8, 34, 46)
numspecies <- c(3, 6, 4, 7, 2, 5, 8, 5, 3, 4)
outcome <- as.factor(c('no','no','yes','yes','no','yes',
                       'no', 'yes', 'yes','no'))

df <- data.frame(outcome, forestcov, numspecies)
library(caret)

#partition data
set.seed(123) 
index <- createDataPartition(df$outcome, p = .8, list = FALSE, times = 1) 
train_df <- df[index,] 
test_df <- df[-index,] 

#specify training methods
specifications <- trainControl(method = "cv", number = 5, 
                               savePredictions = "all", 
                               classProbs = TRUE) 

#specify logistic regression model
set.seed(1234) 
model1 <- train(outcome ~ forestcov + numspecies, 
                data=train_df,
                method = "glm",
                family = binomial, trControl = specifications)

#produce confusion matrix
confusionMatrix(model1, norm = "none")

Best Answer

In your code, model1 never sees the "test" data frame in any way - you pass it the data in train_df, which only contains 8 samples, and you give it the specifications to do 5-fold cross validation data within that training data. There's no reason here to split into the train/test sets up front, as that's what the cross-validation specification is for. In your code, you define the test data in test_df, and then never use those 2 samples for anything at all! You should be running the cross-validated training routine on the full data in df - you're effectively running 5-fold cross-validation on your 8 "training" samples, and holding out 2 samples as an entirely independent test set which aren't used in the cross-validation.