Solved – What do the thresholds on x and y axis of ROC curve represent

aucclassificationdecision-theoryroc

There is a detailed explanation of what the AUC of an ROC curve is here. However I have searched high and low for an explanation regarding what the X and y axes of the ROC curve are. I have understood that they are decision thresholds, but in practical terms what does that mean?

Best Answer

They correspond to different decision thresholds, however they are the proportion of correctly classified data points (true positive rate; usually on the y-axis) and the proportion of "false positives" (usually on the x-axis).

Different values of true and false positive rates can be obtained from the same dataset by applying different thresholds. For example, imagine I have a set of noisy measurements that come from two distributions, one centered at 0 (noise, or "signal-absent" distribution) and one centered at 1 ("signal-present" distribution).

set.seed(1)
library(ggplot2)
d <- data.frame(measurement = c(rnorm(100, mean=1),rnorm(100, mean=0)), 
                signal=c(rep("present",100),rep("absent",100)))
ggplot(d,aes(x=measurement,color=signal,fill=signal))+geom_density(alpha=0.6)+theme_bw()

I want to classify each measurement according to whether it contained or not a signal. The ROC curve is computed just by placing the decision threshold at different measurement values, and computing for each value the proportion of measurements that contain the signal and are correctly classified as such (i.e., they are larger than the threshold), which will be plotted on the y-axis. Similarly, for each threshold one compute also the fraction of measurements that do not contain a signal but are misclassified as containing one ("false positives", that is measurements coming from the noise, or signal-absent distribution that are larger than the threshold).

# functions to compute true and false positive rates
TPR <- function(d, th){ sum(d$signal=="present" & d$measurement>=th) / sum(d$signal=="present")}
FPR <- function(d, th){ sum(d$signal=="absent" & d$measurement>=th) / sum(d$signal=="absent")}

# use all the sorted values are possible threshods
thresholds <- sort(d$measurement)

roc <- data.frame(y=sapply(thresholds, function(th){TPR(d,th)}), 
                 x=sapply(thresholds, function(th){FPR(d,th)}) )

ggplot(roc,aes(x,y))+geom_point()+theme_bw()+labs(y="fraction of 'signal present'\nmeasurements >= threshold\n(true positive rate)", x="fraction of 'signal absent'\nmeasurements >= threshold\n(false positive rate)")+geom_abline(intercept=0,slope=1,lty=2)

In more practical terms, the y-coordinate of each point indicates the probability that a signal-present measurement is correctly classified as such, given a certain value of the decision threshold. The x-coordinate of the same point represents the probability of misclassifying a signal-absent measurement as "signal-present" for the same threshold. If in a particular setting false positives and false negatives have different costs, the ROC curve can be used to find the optimal threshold (that is the threshold which minimizes the expected cost).

Related Question