Solved – Cut-off probability for multi-class problem

classificationmulti-classprobabilitythreshold

I would like to know whether there is a cut-off probability of outcome when classifying observations into more than 2 classes.

For instance, the threshold in binary logistic regression is usually 0.5, so much so that a probability below 0.5 disfavours the reference outcome.

However, in a three-class problem it seems there are many possibilities. I would assume a cut-off probability of 0.333 here, but what if outcomes A, B and C have a probability of, say, 0.333, 0.666 and 0 respectively? Is there a workaround to get at a cut-off probability in multinomial cases?

The purpose is to use a threshold for the comparison between the model and new data to compute deviations from the model.

By way of illustration, consider the case of a binary choice in which deviation is set to 0 when the actual response (A) conforms to the response predicted by the model (also A). In case of non-conformity, we subtract the threshold of 0.5 from the probability of the predicted outcome to compute the deviation score. Any idea how it could be done in case of 3+ classes?

Thank you very much in advance!

Best Answer

There is no default probability cutoff for classifiers. Using 0.5 cutoff is optimal only if you aim at minimizing accuracy (a.k.a. 0-1 loss), and it is a "problematic" and misleading measure of error. There are multiple ways (see also this paper) of determining the cutoffs. They depend on what do you consider as "optimal" choice. Imagine that you choose between advertising product A or B. For every successful purchase you would earn \$1 for product A and \$10 for product B. In such case, it would be much wiser to bet on B more often then for product A, of course depending also on the misclassification rates for both categories and other factors. Moreover, it is also disputable if you should use such cutoffs at all. So there is no simple answer for this question.

All this said, if you aim at minimizing the accuracy, the simplest decision criterion is to choose the class with highest assigned probability

$$ c^* = \operatorname{arg\,max}_i \; \hat p(y=c_i|\mathbf{x}) $$

However under Bayesian decision theory, if different choices have different costs $\ell(y_j, c_i)$, that tell you what would be the cost of classifying as $c_i$ class if the true class is $y_j$, then given the risk function

$$ R(y_j|\mathbf{x}) = \sum_i \; \ell(y_j, c_i) \;\hat p(y=c_i|\mathbf{x}) $$

then the optimal decision is

$$ c^* = \operatorname{arg\,min}_j \; R(y_j|\mathbf{x}) $$

clearly, the choice in both cases would be the same only under 0-1 loss, i.e.

$$ \ell(y_j, c_i) = \begin{cases} 0 & \mathrm{if} \; y_j = c_i \\ 1 & \mathrm{if} \; y_j \ne c_i \end{cases} $$

What is just a more formal statement of the above example with the advertisements for the A and B products. Of course, this does not make much difference if you are building a hot dog vs not a hot dog classifier, but if you are aiming to block violent of explicit content, then probably better block if it unsure, then leave potentially disturbing content. Same if you are running in a Kaggle contest, then using non-"default" cutoff may make a difference.