I completed a 5-fold cross validation and produced a confusion matrix that (I believe) summarizes the results of the 5-fold cv; however, I don't understand why there are only 8 observations in the confusion matrix. My reproducible data set has 10 observations. I did an 80/20 training/testing split…But, I thought the confusion matrix would include all 10 observations, 2 from each test multiplied by the number of folds (5). Why are there only 8?
forestcov <- c(45, 67, 35, 67, 12, 43, 75, 8, 34, 46)
numspecies <- c(3, 6, 4, 7, 2, 5, 8, 5, 3, 4)
outcome <- as.factor(c('no','no','yes','yes','no','yes',
'no', 'yes', 'yes','no'))
df <- data.frame(outcome, forestcov, numspecies)
library(caret)
#partition data
set.seed(123)
index <- createDataPartition(df$outcome, p = .8, list = FALSE, times = 1)
train_df <- df[index,]
test_df <- df[-index,]
#specify training methods
specifications <- trainControl(method = "cv", number = 5,
savePredictions = "all",
classProbs = TRUE)
#specify logistic regression model
set.seed(1234)
model1 <- train(outcome ~ forestcov + numspecies,
data=train_df,
method = "glm",
family = binomial, trControl = specifications)
#produce confusion matrix
confusionMatrix(model1, norm = "none")
Best Answer
In your code, model1 never sees the "test" data frame in any way - you pass it the data in train_df, which only contains 8 samples, and you give it the specifications to do 5-fold cross validation data within that training data. There's no reason here to split into the train/test sets up front, as that's what the cross-validation specification is for. In your code, you define the test data in test_df, and then never use those 2 samples for anything at all! You should be running the cross-validated training routine on the full data in df - you're effectively running 5-fold cross-validation on your 8 "training" samples, and holding out 2 samples as an entirely independent test set which aren't used in the cross-validation.