Solved – MCC or F-measure, which measure is the best to represent multi-class confusion matrix

classificationmulti-class

I'm building a classifier of multi-class problem. In my context I have to classify vehicles into five categories: cars, vans, trucks, buses and motorcycles. I would like to use some measure to represent the classification performance for each class and for all of them. In the samples I have the motorcycles are few meanwhile there are a lot of cars so the classes are clearly unbalanced. I read that the Matthews correlation coefficient (MCC) is better in this case but I didn't find any reference that confirm this.

Please, could you help me on this I'm a little bit lost and I don't know which measure to use.

Best Answer

https://eva.fing.edu.uy/pluginfile.php/69453/mod_resource/content/1/7633-10048-1-PB.pdf I think this will be a good reference for you.

To my understanding, F-measure is a "mean" of precision and recall (2*precision*recall/(precision+recall)). For an unbalanced testing set, precision, recall, or F-measure could be confusing and misleading because a random guessing would make F-measure higher than 0.5. But MCC will give you a value around 0 in this situation. So I think MCC is a better choise in your case.

Related Solutions

Solved – Multi-class Confusion Matrix to Binary confusion matrix

Welcome to the website, this is a variation of a commonly asked question. You can definitely convert a multi class matrix to a binary conf matrix.

Below is some R code on how you can collapse a confusion matrix to a binary one. It also calculates Cohen's kappa to get the overall 'rater' agreement between the classifeir and the actual class (of cmg).

cmg <- matrix(c(1639, 116, 49, 35, 138, 0, 0, 236,
                 150, 274, 27, 21,  28, 0, 0,  73,
                  22,  24, 58,  9,  94, 0, 0,  30,
                  33,  27, 31, 21, 146, 0, 0,  49,
                  14,   9,  5,  1,  49, 0, 0,  22,
                   1,   0,  1,  1,   7, 0, 0,   6,
                  11,   0,  0,  1,  14, 0, 0,  21,
                 201,  11,  8,  5,  49, 0, 0, 253), 
              ncol=8,dimnames = rep(list(("T1","T2","T3","T4","T5", "T6","T7", "T8")),2))

require(psych)

# Overall agreement
overall_agg <- sum(diag(cmg))/sum(cmg)

# Overall Cohen's Kappa for cmg
unweighted_kappa <- cohen.kappa( cmg, n.obs=sum(cmg) )

# initialise containers
spec_agr_guideline <- list()        
collapsed_mat_guideline <- list() 
unweighted_kappa_psych <- list()

# loop through all treatments    
for (i in seq(1,nrow(cmg)) ) {
  # Specific agreements
  spec_agr_guideline [i] <- 2*cmg[i,i] / (sum(cmg[i,]) + sum (cmg[,i]) )
  # Collapsed positive agreement confusion matrices per treatment
  collapsed_mat_guideline[[i]] <- matrix(c(cmg[i,i],             sum(cmg[i,])-cmg[i,i],
                                         sum(cmg[,i])-cmg[i,i],  sum(cmg)-sum(cmg[i,])-sum(cmg[,i])+cmg[i,i]), 
                                       ncol=2)
  # Calculate unweighted Cohen's Kappa per collapsed (binary) confusion amtrix
  unweighted_kappa_psych[[i]] <- cohen.kappa( collapsed_mat_guideline[[i]], n.obs=sum(collapsed_mat_guideline[[i]]) )

Furthermore, you can do some other cool stuff to assess the performance of a multi-class classifier. Some relevant answers from CrossValidated.com are: link1, link2, link3.

Solved – the best measure for unbalanced multi-class classification problem

My apologies, just saw how old the question was -- why was it on the top of the list?

Answer (which is as good as it gets with limited information):

Of what kind is the data?

You should probably never use detection accuracy or certainly not when your classifier outputs a score or probability. How do you classify? The underlying loss function of your classification algorithm is usually a good measure to start with when it comes to evaluation performance.

I would not lean towards 1~vs~all analytic approaches, such as the precision recall curve(s). It won't get you very far -- you would have to test each class against all others and then combine these results somehow. Harmonic mean, a-priori likelihood given the class to be tested, ... ? It is unclear what these measures will actually tell you.

If you have probabilistic output , the negative log likelihood is a good place to start with.

If you already have 70% accuracy for class 1, which means 70% of your dataset are class 1, then you might be in the situation that your classifier gives up on some smaller classes and rather tries to satisfy a possible regularization term. But this is all really dependent on your classification scheme. If you want a clearer answer, you need to tell us the whole story. ;)

Best Answer

Related Solutions

Solved – Multi-class Confusion Matrix to Binary confusion matrix

Solved – the best measure for unbalanced multi-class classification problem

Related Question