ROC Curve – How to Draw for a Multi-Class Dataset

classificationmachine learningmulti-classroc

I have a multi-class confusion matrix as below and would like to draw its associated ROC curve for one of its classes (e.g. class 1).
I know the "one-VS-all others" theory should be used in this case, but I want to know how exactly we need to change the threshold to obtain different pairs of TP and corresponding FP rates.

Best Answer

I assume you use something like softmax to get probability estimates for each class. So, let's say we want to calculate the ROC curve for class $c$. For each sample, you'll get $P(y=c|x)$ from the softmax, and one minus this is the probability of "others", i.e. in this case it can be referred as negative, which means the class $c$ samples are positive. Then, by changing the threshold in the range $[0,1]$, you'll obtain the number of TP and FP for each threshold, which you can directly plot.

Related Solutions

Solved – Decision threshold for a 3-class Naive Bayes ROC curve

As I see it, the possibility to refuse classification as "too uncertain" is the whole point of choosing a threshold (as opposed to assigning the class with highest predicted probability).

Of course, you should have some justification for putting the threshold to 0.5: you may also put it up to 0.9 or any other value that is reasonable.

You describe a setup with mutually exclusive classes (closed-world problem). "No class reaches the threshold" can always happen as soon as that threshold is higher than 1/$n_{classes}$, i.e. the same problem occurs in a 2-class problem with threshold, say, 0.9. For threshold = 1/$n_{classes}$ it could happen in theory, but in practice it is highly unlikely.

So your problem is not related (just more pronounced) to the 3-class set-up.

To your second question: you can compute ROC curves for any kind of continuous output scores, they don't even need to claim that they are probabilities. Personally, I don't calibrate, because I don't want to waste another test set on that (I work with very restricted sample sizes). The shape of the ROC anyways won't change.

Answer to your comment: The ROC conceptually belongs to a set-up that in my field is called single-class classification: does a patient have a particular disease or not. From that point of view, you can assign a 10% probability that the patient does have the disease. But this does not imply that with 90% probability he has something defined - the complementary 90% actually belong to a "dummy" class: not having that disease. For some diseases & tests, finding everyone may be so important that you set your working point at a threshold of 0.1. Textbook example where you choose an extreme working point is HIV test in blood donations.

So for constucting the ROC for class A (you'd say: the patient is A positive), you look at class A posterior probabilities only. For binary classification with probability (not A) = 1 - probability (A), you don't need to plot the second ROC as it does not contain any information that is not readily accessible from the first one.

In your 3 class set up you can plot a ROC for each class. Depending on how you choose your threshold, no classification, exactly one class, or more than one class assigned can result. What is sensible depends on your problem. E.g. if the classes are "Hepatitis", "HIV", and "broken arm", then this policy is appropriate as a patient may have none or all of these.

ROC Curve – Why ROC Curve and Thresholds Never Have the Ideal Point at the Top Left for Observations Close to Certainty

You seem to have a few misunderstandings about ROC curves.

I am using ROC curves for multi-label classification.

ROC curve are tools to assess the discrimination ability of binary classifiers. Some extensions exist for different types of problems such as multi class or multi label classification, but they are not ROC curves strictly speaking.

an ROC curve is parameterized by a discrimination threshold

A ROC curve is parameterized over all possible discrimination thresholds between $-\infty$ and $+\infty$.

With a discrimination threshold of 0.9, we assign that observation correctly and no observation incorrectly.

With a threshold of 0.9, we indeed (correctly) assign observation 1 to the positive predicted class.

All other observations are assigned to the negative predicted class. Because observations 2-5 are < 0.9, we assign them to the negative predicted class. As a result, observations 2 and 4, which should be positive, are misclassified as negatives, and decrease the True Positive Rate (sensitivity) and the AUC.

Because ROC curve is designed for binary classification problems, there is no such thing as "unassigned". If things are not positive, they are negative. If this assumption is not appropriate for your problem, then you're not having a binary classification problem, and ROC curves may be the wrong tool for you.

The True Positive Rate is 1 and the False Positive Rate is 0, which is the ideal point at the top left in an ROC curve. We never see that point in an ROC curve

This is wrong, this point is seen as soon as you have a perfect classifier. This might be hard to achieve in your field or for your problem, but it definitely exists.

How exactly does an ROC curve use the discrimination threshold?

I refer you to this CV question: What does AUC stand for and what is it?, which should answer this part of your question.