Welcome to the website, this is a variation of a commonly asked question. You can definitely convert a multi class matrix to a binary conf matrix.
Below is some R code on how you can collapse a confusion matrix to a binary one. It also calculates Cohen's kappa to get the overall 'rater' agreement between the classifeir and the actual class (of cmg
).
cmg <- matrix(c(1639, 116, 49, 35, 138, 0, 0, 236,
150, 274, 27, 21, 28, 0, 0, 73,
22, 24, 58, 9, 94, 0, 0, 30,
33, 27, 31, 21, 146, 0, 0, 49,
14, 9, 5, 1, 49, 0, 0, 22,
1, 0, 1, 1, 7, 0, 0, 6,
11, 0, 0, 1, 14, 0, 0, 21,
201, 11, 8, 5, 49, 0, 0, 253),
ncol=8,dimnames = rep(list(("T1","T2","T3","T4","T5", "T6","T7", "T8")),2))
require(psych)
# Overall agreement
overall_agg <- sum(diag(cmg))/sum(cmg)
# Overall Cohen's Kappa for cmg
unweighted_kappa <- cohen.kappa( cmg, n.obs=sum(cmg) )
# initialise containers
spec_agr_guideline <- list()
collapsed_mat_guideline <- list()
unweighted_kappa_psych <- list()
# loop through all treatments
for (i in seq(1,nrow(cmg)) ) {
# Specific agreements
spec_agr_guideline [i] <- 2*cmg[i,i] / (sum(cmg[i,]) + sum (cmg[,i]) )
# Collapsed positive agreement confusion matrices per treatment
collapsed_mat_guideline[[i]] <- matrix(c(cmg[i,i], sum(cmg[i,])-cmg[i,i],
sum(cmg[,i])-cmg[i,i], sum(cmg)-sum(cmg[i,])-sum(cmg[,i])+cmg[i,i]),
ncol=2)
# Calculate unweighted Cohen's Kappa per collapsed (binary) confusion amtrix
unweighted_kappa_psych[[i]] <- cohen.kappa( collapsed_mat_guideline[[i]], n.obs=sum(collapsed_mat_guideline[[i]]) )
Furthermore, you can do some other cool stuff to assess the performance of a multi-class classifier. Some relevant answers from CrossValidated.com are: link1, link2, link3.
My apologies, just saw how old the question was -- why was it on the top of the list?
Answer (which is as good as it gets with limited information):
Of what kind is the data?
You should probably never use detection accuracy or certainly not when your classifier outputs a score or probability. How do you classify? The underlying loss function of your classification algorithm is usually a good measure to start with when it comes to evaluation performance.
I would not lean towards 1~vs~all analytic approaches, such as the precision recall curve(s). It won't get you very far -- you would have to test each class against all others and then combine these results somehow. Harmonic mean, a-priori likelihood given the class to be tested, ... ? It is unclear what these measures will actually tell you.
If you have probabilistic output , the negative log likelihood is a good place to start with.
If you already have 70% accuracy for class 1, which means 70% of your dataset are class 1, then you might be in the situation that your classifier gives up on some smaller classes and rather tries to satisfy a possible regularization term. But this is all really dependent on your classification scheme. If you want a clearer answer, you need to tell us the whole story. ;)
Best Answer
https://eva.fing.edu.uy/pluginfile.php/69453/mod_resource/content/1/7633-10048-1-PB.pdf I think this will be a good reference for you.
To my understanding, F-measure is a "mean" of precision and recall (2*precision*recall/(precision+recall)). For an unbalanced testing set, precision, recall, or F-measure could be confusing and misleading because a random guessing would make F-measure higher than 0.5. But MCC will give you a value around 0 in this situation. So I think MCC is a better choise in your case.