I am attempting to run a 5-fold cross validation in R for a logistic regression model followed up by confusion matrices for each "fold." However, my code is only producing one confusion matrix.
forestcov <- c(45, 67, 35, 67, 12, 43, 75, 8, 34, 46)
numspecies <- c(3, 6, 4, 7, 2, 5, 8, 5, 3, 4)
outcome <- as.factor(c('no','no','yes','yes','no','yes',
'no', 'yes', 'yes','no'))
df <- data.frame(outcome, forestcov, numspecies)
library(caret)
#partition data
set.seed(123)
index <- createDataPartition(df$outcome, p = .5, list = FALSE, times = 1)
train_df <- df[index,]
test_df <- df[-index,]
#specify training methods
specifications <- trainControl(method = "cv", number = 5,
savePredictions = "all",
classProbs = TRUE)
#specify logistic regression model
set.seed(1234)
model1 <- train(outcome ~ forestcov + numspecies,
data=train_df,
method = "glm",
family = binomial, trControl = specifications)
#produce confusion matrix
predictions <- predict(model1, newdata = test_df)
confusionMatrix(data = predictions, test_df$outcome)
This produces one matrix. My goal is to run a 5-fold cross validation and produce a matrix for every fold. I can't figure out why I am only getting one matrix. Is the error in my cross validation or the confusion matrix, and how should I go about correcting it?
Best Answer
Your data is too small, so that some of the
glm
models trained on different folds fail to converge. With random synthetic data, I obtained the following results for $5$-fold CV, we need to define oursummaryFunction
and usereturnResamp = 'all'
in thetrainControl
, in order to obtain the confusion matrices for the models on the folds:It will print the confusion matrix for each fold
The metrics can be obtained for each fold in the following way:
With
summaryFunction = twoClassSummary
we obtain the following result for each fold:Also,
glm
does not have any tunable hyper-parameter forcaret
it seems, you can useglmnet
and tune the $\mathbb{L}_1$ and $\mathbb{L}_2$ regularization hyper-parameters of the elastic net with CV and obtain the best model w.r.t. the average CV score on the folds.