Suppose I do K-fold cross-validation with K=10 folds. There will be one confusion matrix for each fold. When reporting the results, should I calculate what is the average confusion matrix, or just sum the confusion matrices?
Solved – How is the confusion matrix reported from K-fold cross-validation
accuracycross-validationmachine learning
Related Question
- Solved – How to perform 10-fold cross validation by manually constructing datasets
- Solved – Reporting variance of the repeated k-fold cross-validation
- Solved – How to report confusion matrix for repeated K-fold cross-validation
- Solved – Neural Networks – Epochs with 10-fold Cross Validation – doing something wrong
- Creating a confusion matrix for every fold from k-fold cross validation in R
- Interpretation of number of observations in confusion matrix after k-fold cross validation
Best Answer
If you are testing the performance of a model (i.e. not optimizing parameters), generally you will sum the confusion matrices. Think of it like this, you have split you data in to 10 different folds or 'test' sets. You train your model on 9/10 of the folds and test the first fold and get a confusion matrix. This confusion matrix represents the classification of 1/10 of the data. You repeat the analysis again with the next 'test' set and get another confusion matrix representing another 1/10 of the data. Adding this new confusion matrix to the first now represents 20% of your data. You continue until you have run all your folds, sum all your confusion matrices and the final confusion matrix represents that model's performance for all of the data. You could average the confusion matrices but that doesn't really provide any additional information from the cumulative matrix and may be biased if your folds are not all the same size.
Note -- this assumes non-repeated sampling of your data. I'm not completely certain if this would be different for repeated sampling. Will update if I learn something or someone recommends a method.