Solved – Is it reasonable for a classifier to obtain a high AUC and a low MCC? Or the opposite

auccorrelationroc

Let's say I have 2 models:

1) High Matthew's correlation coefficient (MCC) score, low area under the curve (AUC)

2) Low MCC, high AUC

When I say high and low, I mean relatively to the other model. I'm not quite sure which model is "better" and how to interpret this difference between the 2 models. Also to clarify, these 2 models both return probability estimates. MCC is determined at a threshold of 0.5.

Best Answer

A binary classifier might produce at prediction time either directly the class labels for each classified instance, or some probability values for each class. In the later case, for each instance it will produce in binary case a probability value $p$ for one class, and a probability value $q=1-p$ for the second class.

If the classifier produces probabilities, one have to use a threshold value in order to obtain classification labels. Usually, this threshold value is $0.5$, but this is often not the case, and sometimes it is not the best possible value.

Now, MCC is computed directly from classification labels, which means that it was used a single threshold value to transform probabilities into classification labels, no matter what the threshold value was. AUC, on the other hand is using the whole range of threshold values.

The idea is that these two values, AUC and MCC, measure different things. While MCC measures a kind of statistical accuracy (related with Chi squared test which gives some hints on the significance of the differences), the AUC is more related with the robustness of the classifier. AUC and ROC curves gives more hints on how well the classifier separates the binary classes, for all possible threshold values. Even for the degenerate case, where AUC is computed directly on labels (not advisable since it looses a lot of information), the purpose of AUC remains the same.

Model selection is a hard problem. My advice for you would be to try yourself to answer to the question: what means a better classifier? Find an answer which includes considerations like: cost matrix, robustness to unbalanced data, an optimistic or conservative classifier, etc. Anyway, provide enough details until you find one or few criteria which can be used to select a metric before trying to measure different things related with accuracy and ask later what to do with them.

[Later edit - related with usage of robust word]

I used the term "robust" because I could not find a proper single word for "how well a classifier separates the two classes". I know that the term "robust" has some special meaning in statistics.

Generally, an AUC close to $1.0$ separates well binary cases for many values of the threshold. In this sense, the AUC close to $1.0$ means that it is less sensitive to which threshold values is used, which means its robust to this choice. However, a value which is not close to $1.0$ does not mean the contrary, which does not necessarily mean that there is not enough range of good threshold values. In most cases, a graphical inspection of ROC curve is necessary. This is one of the main reasons why AUC is often considered misleading. AUC is a measure over all possible classifiers (one for each possible threshold value), and is not a measure of a specific classifier, but in practice one can't use more than one threshold value. While AUC can give hints about the well-separation (my use of the term "robustness"), it is not used alone as a single authoritative measure of accuracy.

Related Question