Solved – How to determine the optimal threshold to achieve the highest accuracy

optimizationthreshold

I have a list of probabilities outputted by a classifier on a balanced dataset. The metric I want to maximize is accuracy ($\frac{TP+TN}{P+N}$). Is there a way to calculate the best threshold (without iterating over many threshold values an selecting the best one), given the probabilities and their true labels.

Best Answer

I suspect that the answer is "no", i.e., that there is no such way.

Here is an illustration, where we plot the predicted probabilities against the true labels:

Since the denominator $P+N$ in the formula for accuracy does not change, what you are trying to do is to shift the horizontal red line up or down (the height being the threshold you are interested in) in order to maximize the number of "positive" dots above the line plus the number of "negative" dots below the line. Where this optimal line lies depends entirely on the shape of the two point clouds, i.e., the conditional distribution of the predicted probabilities per true label.

Your best bet is likely a bisection search.

That said, I recommend you look at

Related Solutions

Solved – How to deal with datasets that have many values out of range / over threshold

The most correct way to handle this is to model the probability of overshooting the threshold separately (note: handling the overshoots as NA would put them at the same position as missing data, which is also very common in biomarkers, but need a whole different handling. Either way this kind of 'missing data' is, if I might coin the term, 'missing completely not at random'). This is not an easy undertaking. A colleague of mine is working on this, and has already shown that different results can be obtained with the correct analysis.

Apart from that: if you are not aiming high (in statistical correctness), I fear that the 'accepted standard', in many fields, for this kind of situation is indeed one of your first two options. I may not like it, but there are worse 'accepted practices' around. Check the literature of your field of interest to see what others do, or chose to do hard work on getting elaborate models to fit.

Solved – Threshold in precision/recall curve

Short answer: Torgo describes the usual method of generating such curves.

You can choose your threshold (= cut-off limit in the cited text) at any value. The cited text refers to one such choice as a working point.
That is, for a given working point, you'll observe exactly one (precision; recall) pair, i.e. one point in your graph. The precision-recall-curve is obtained by varying the threshold over the whole range of the classifier's continuous output ("scores", posterior probabilities, "votes") thus generating a curve from many working points.

Edit with respect to the comment:

I think "varying the threshold" is the usual way to explain or define the curve.

For the calculation, it is more efficient to sort the scores, and then see how precision and recall change when adding the next case: precision and recall can only change when the change in the threshold is large enough to cover the next score.

Consider this example:

case   true class   predicted score (high => class B)
1      A            0.2
3      B            0.5
2      A            0.6
4      B            0.9

threshold      recall    precision
> 0.9          N/A       0.0
(0.6, 0.9]     0.5       1.0        
(0.5, 0.6]     0.5       0.5
(0.2, 0.5]     1.0       0.67
< 0.2          1.0       0.5

That is, the precision-recall-curve acutally consists of points. It jumps from one point to the next when the threshold "crosses" an acutally predicted score. A smooth curve will result only for large numbers of test cases.

Best Answer

Related Solutions

Solved – How to deal with datasets that have many values out of range / over threshold

Solved – Threshold in precision/recall curve

Related Question