Many binary classification algorithms compute a sort of classification score (sometimes but not always this is a probability of being in the target state), and they classify based upon whether or not the score is above a certain threshold. Viewing the ROC curve lets you see the tradeoff between sensitivity and specificity for all possible thresholds rather than just the one that was chosen by the modeling technique. Different classification objectives might make one point on the curve more suitable for one task and another more suitable for a different task, so looking at the ROC curve is a way to assess the model independent of the choice of a threshold.
As I see it, the possibility to refuse classification as "too uncertain" is the whole point of choosing a threshold (as opposed to assigning the class with highest predicted probability).
Of course, you should have some justification for putting the threshold to 0.5: you may also put it up to 0.9 or any other value that is reasonable.
You describe a setup with mutually exclusive classes (closed-world problem). "No class reaches the threshold" can always happen as soon as that threshold is higher than 1/$n_{classes}$, i.e. the same problem occurs in a 2-class problem with threshold, say, 0.9. For threshold = 1/$n_{classes}$ it could happen in theory, but in practice it is highly unlikely.
So your problem is not related (just more pronounced) to the 3-class set-up.
To your second question: you can compute ROC curves for any kind of continuous output scores, they don't even need to claim that they are probabilities. Personally, I don't calibrate, because I don't want to waste another test set on that (I work with very restricted sample sizes). The shape of the ROC anyways won't change.
Answer to your comment:
The ROC conceptually belongs to a set-up that in my field is called single-class classification: does a patient have a particular disease or not. From that point of view, you can assign a 10% probability that the patient does have the disease. But this does not imply that with 90% probability he has something defined - the complementary 90% actually belong to a "dummy" class: not having that disease. For some diseases & tests, finding everyone may be so important that you set your working point at a threshold of 0.1. Textbook example where you choose an extreme working point is HIV test in blood donations.
So for constucting the ROC for class A (you'd say: the patient is A positive), you look at class A posterior probabilities only. For binary classification with probability (not A) = 1 - probability (A), you don't need to plot the second ROC as it does not contain any information that is not readily accessible from the first one.
In your 3 class set up you can plot a ROC for each class. Depending on how you choose your threshold, no classification, exactly one class, or more than one class assigned can result. What is sensible depends on your problem. E.g. if the classes are "Hepatitis", "HIV", and "broken arm", then this policy is appropriate as a patient may have none or all of these.
Best Answer
For an overall explanation of how ROC curves are computed consider this excellent answer: https://stats.stackexchange.com/a/105577/112731
To your question: first, if you want to compare different approaches, comparing their ROC curves and area under curve (AUC) values directly will be a good idea, as those give you overall information about how powerful your approaches are on your problem.
Second: you will need to choose a threshold appropriate for your goal. The tradeoff with this is that you will need to decrease one of the TPR (true positive rate, or sensitivity), or TNR (true negative rate, or specificity) in order to increase the other - there is no way around this$^1$. So, depending on your problem, you might e.g. happen to need a low false positive rate (FPR = 1 - TNR), which in turn will require you to have a high TNR - so this will definitely depend on details of your problem.
Having said this, to choose a threshold you will usually look at both the ROC curve and the distribution of TPR and TNR over the threshold. Those should provide the required information for you to choose a reasonable tradeoff. As you want to do this in R, here's a minimal example of how this could look like:
So in this example, for about equal TPR and TNR, you would want to choose a threshold around 0.5. If you would want to e.g. have very low FPR you would want to choose a lower threshold instead. After choosing a
threshold
, you can use the predicted class probabilities to immediately determine the predicted class:$^1$ For completeness: predicted class probabilities from your model are made either a "positive" prediction (usually above the threshold) or a "negative" prediction (usually below the threshold) by this.
Update:
As you just asked for how this would be done with e.g.
nnet()
, here's a minimal example:Please note that training on all data will lead to overfitting, so you should instead use techniques like cross validation and resampling (e.g. with the
caret
package, as shown above - there you would just need to setmethod='nnet'
for using this model, and could provide hyperparameter in thetuneGrid
parameter).