Solved – Determine thresholds for test from ROC-curve

data visualizationpythonrocscikit learn

I'm trying to determine the threshold from my original variable from an ROC curve. I have generated the curve using the variable and outcome, and I have generated threshold data from sklearns ROC function. However, I am confused as to how the threshold relates back to the values of the variable for identification of the cut off.

I've produced a minimum working example:

import numpy as np
from sklearn import metrics
from sklearn.linear_model import LogisticRegression
import matplotlib.pyplot as plt

x = np.random.randint(40, 400, 100).reshape(-1, 1)
y = np.random.randint(0, 2, 100)

model = LogisticRegression()
model.fit(x, y)
probs = model.predict_proba(x)
fpr, tpr, thresholds = metrics.roc_curve(y, probs[:,1])

plt.plot(tpr, fpr)
plt.plot(np.linspace(0,1,10), np.linspace(0,1,10))

threshold_of_interest = threshold[np.argmax(trp - fpr)]

So basically how to relate the 'threshold_of_interest' back to 'x'?
Thanks!

Best Answer

Thanks for supplying an (almost) working example. I fixed some typos and a plot that might help you to understand the output.

import numpy as np
from sklearn import metrics
from sklearn.linear_model import LogisticRegression
import matplotlib.pyplot as plt

x = np.random.randint(40, 400, 100).reshape(-1, 1)
y = np.random.randint(0, 2, 100)

model = LogisticRegression()
model.fit(x, y)
probs = model.predict_proba(x)
fpr, tpr, thresholds = metrics.roc_curve(y, probs[:,1])

# %%
plt.subplots(figsize=(10, 6))
plt.plot(fpr, tpr, 'o-', label="ROC curve")
plt.plot(np.linspace(0,1,10), np.linspace(0,1,10), label="diagonal")
for x, y, txt in zip(fpr[::5], tpr[::5], thresholds[::5]):
    plt.annotate(np.round(txt,2), (x, y-0.04))
rnd_idx = 27
plt.annotate('this point refers to the tpr and the fpr\n at a probability threshold of {}'.format(np.round(thresholds[rnd_idx], 2)), 
             xy=(fpr[rnd_idx], tpr[rnd_idx]), xytext=(fpr[rnd_idx]+0.2, tpr[rnd_idx]-0.25),
             arrowprops=dict(facecolor='black', lw=2, arrowstyle='->'),)
plt.legend(loc="upper left")
plt.xlabel("FPR")
plt.ylabel("TPR")

enter image description here

Remember, that the ROC curve is based on a confidence threshold. Here you provided the probabilities from the LR classifier. Normally, you would use 0.5 as decision boundary. However, you can choose whatever boundary you want - and the ROC curve is there to help you! Sometimes TPR is more important to you than FPR. When you only plot the TPR and the FPR against each other you'll loose the threshold information. However, you can easily add them to the plot. I only annotated every 5th value but this should be enough the see the relationship (high confidence - bottom left, low confidence - top right). Since your question was actually "how to relate the treshold of interest back to x", the answer is you cannot. X was your input matrix on which you performed the prediction. The thresholds are only related to the prediction from the LR classifier (probs in your code).

Note that sklearn does not compute the tpr / fpr after each entry. The dimension of your tpr is (60,) but your test case had dimension (100,). You can read up on this by studying the the drop_intermediate parameter.

HTH