Binary Classification – Is There a Binary Classification Metric that Optimizes True Positive Predictions Given an Upper Bound on False Discovery Rate?

binary dataclassificationmachine learningmetric

I'm looking for a metric that behaves like this

Here, N can be interpreted as the total number of positive predictions. (Alternatively, you can interpret N as the total number of true positives. Either should be fine.)

I want the score to fall off a cliff if the false discovery rate goes above 5%. Otherwise, the metric should strive for the highest number of positive classifications.

Does such a classification metric exist?

Motivational Example

Suppose I'm building a computer vision model to identify shoplifters. I want the ability to promise that the model will flag as many shoplifters as possible without exceeding a 5% false discovery rate.

Best Answer

You could consider incorporating the classification threshold in the optimization of TPR (along with the usual parameter estimates) and applying a constraint that FPR be less than $5\%$. This would be something like a Karush-Kuhn-Tucker constrained optimization (maximize TPR, constrained to FPR no greater than a particular rate like $5\%$), though the function being optimized will have jumps rather than being smooth. This gets rid of the richness of probabilistic predictions but might be your best bet if you have a hard cutoff of $5\%$ FPR. Then your metric would be TPR, given FPR no greater than the maximum allowed. (If you allow for parameter tuning at the same time as threshold tuning, my suspicion is that you can hit exactly $5\%$ FPR, at least in theory.)

This is evocative of hypothesis test theory where we consider the power (similar to TPR) of a test at a given tolerance for type I errors (false positives).

ORIGINAL

Your comments indicate that $5\%$ FDR is not such a hard requirement. Consequently, it is not quite correct to say that you want to optimize TPR without exceeding $5\%$ FDR.

It sounds like you want accurate predictions of positive/negative probability, which is exactly what the standard Crossentropy loss measures: lower loss means better predictions of the positive/negative probability. Crossentropy loss is one example of a so-called “strictly proper scoring rule”.

Once you have accurate probabilities, you can fiddle with thresholds to get your desired tradeoff of FPR and TPR, which might turn out to mean that you tolerate $6\%$ or $8\%$ FPR because of how much of an improvement in TPR you get by accepting a slightly higher FPR. Likewise, you might find yourself only accepting $3\%$ FPR due to how little improvement you get in TPR upon tolerating slightly higher FPR; you might have to go up to $20\%$ FPR to get much improvement in TPR.

Note that there can be multiple actions despite there being only two observed categories. That is one place where predicting probabilities rather than discrete classes gives a lot of power. It’s good to be able to say, “I don’t know, and making either kind of wrong classification is pretty costly, so I’d like to gather more information.” We prefer when models can be confident in their predictions, sure, but that isn’t always realistic.

EDIT

Models tend to output values in a continuum, frequently bounded between $0$ and $1$ and interpreted as probabilities. A common, maybe even ubiquitous, software default is to use $0.5$ as a threshold for binning the continuous values into discrete categories. For a software default, if there must be one, I suppose it makes sense to classify the observation according to the class to which it most probably belongs. However, there is a cost associated with wrong decisions, and you might be willing to sacrifice classifications of a not-so-important category in order to catch the category of primary interest. As an example from medicine, it is inconvenient to have to go into a Covid quarantine, but society seemed to deem it reasonable to put non-Covid cases in quarantine to assure that very few true Covid cases got missed and wandered about the world to infect others.

(You are free to disagree with how Covid quarantines and lockdowns occurred, but many societies did take this stance. That just means you would prefer a different cutoff than society seemed to adopt. Likewise, if you think society didn’t do enough and should have been in a near-$100\%$ lockdown, it just means you would have preferred a different threshold and much lower evidence of Covid in order to warrant isolation.)

Thus, we might want to change the threshold away from $0.5$ in order to suit our values about the decisions. Maybe, instead of requiring high evidence of a category being likely, we want to flag as soon as the probability is above $0.1$. Then $0.1$ would be our threshold instead of $0.5$, and we would get the TPR and FPR rates that go along with such a threshold.

By fiddling with the threshold, you can tune your FPR and TPR.

Let’s look at a simulation where I do some tuning of the threshold.

library(MLmetrics)
set.seed(2022)
N <- 100
p <- sort(runif(N, 0, 1))
y_true <- rbinom(N, 1, p)
p_pred <- sort(rbeta(N, 1/3, 1/3))

# Let's use the default threshold of 0.5
y_pred <- round(p_pred)
tpr <- MLmetrics::Sensitivity(y_true, y_pred)
fpr <- 1- MLmetrics::Specificity(y_true, y_pred)
print(paste(
  "Threshold: ", 0.5
  "TPR:", tpr, 
  "; FPR,", fpr
))

# Let's use a threshold of 0.1 = 0.5 - 0.4
# There's a better way of doing this than subtracting from the 
# predicted probabilities, but I can't remember the syntax
#
y_pred <- round(p_pred - 0.4)
tpr <- MLmetrics::Sensitivity(y_true, y_pred)
fpr <- 1- MLmetrics::Specificity(y_true, y_pred)
print(paste(
  "Threshold: ", 0.1,
  "TPR:", tpr, 
  "; FPR,", fpr
))

# Let's use a threshold of 0.9 = 0.5 + 0.4
#
y_pred <- round(p_pred + 0.4)
tpr <- MLmetrics::Sensitivity(y_true, y_pred)
fpr <- 1- MLmetrics::Specificity(y_true, y_pred)
print(paste(
  "Threshold: ", 0.9,
  "TPR:", tpr, 
  "; FPR,", fpr
))

# Let's use a threshold of 0.99 = 0.5 + 0.49
#
y_pred <- round(p_pred + 0.49)
tpr <- MLmetrics::Sensitivity(y_true, y_pred)
fpr <- 1- MLmetrics::Specificity(y_true, y_pred)
print(paste(
  "Threshold: ", 0.99,
  "TPR:", tpr, 
  "; FPR,", fpr
))

# Let's use a threshold of 0.95 = 0.5 + 0.45
#
y_pred <- round(p_pred + 0.45)
tpr <- MLmetrics::Sensitivity(y_true, y_pred)
fpr <- 1- MLmetrics::Specificity(y_true, y_pred)
print(paste(
  "Threshold: ", 0.95,
  "TPR:", tpr, 
  "; FPR,", fpr
))

I get that your ideal threshold is between $0.95$ and $0.99$. You can fiddle with this to your heart's content to get the threshold that works best for you, noting the way the threshold impacts both FPR and TPR. If, in your analysis of this, you find yourself getting a huge increase in TPR by accepting an FPR of $0.07$ instead of $0.05$, it is your perogative to accept that.

This is related to, but not quite the same as, looking at ROC curves. Key differences are:

ROC curves have nothing to do with calibration, and you can divide all of your probabilities by two and wind up with the same ROC curve.
ROC curves can involve TPR and FPR values that you would never accept.

Frank Harrell has written negatively of ROC curves. He has this amusing tweet and multiple mentions of his dislike of ROC curves in his blog and posts on Cross Validated. Among his complaints are that TPR and FPR are not aligned with forward-looking decision-making, and you might consider looking and forward-looking metrics like positive predictive value and negative predictive value. However, if you are determined to work with TPR and FPR, then that’s what you have to work with.

If you do choose to go with positive and negative predictive values, you also can tune the threshold as I have done in order to get your desired tradeoff. After all, you have to create the categories out of the probability outputs somehow, and there is no reason why the software default must be the appropriate threshold for your task.

Note that tuning the threshold and forcing every observation to be in a discrete category gets rid of your ability to have a grey zone of “I don’t know, collect more data,” which might be the best decision in such a situation.

Motivational Example

Best Answer

Related Solutions

Solved – Reduce false positives with XGBoost

Solved – How to combine linear and non-linear models

Related Question