Solved – Understanding ROC curve

rroc

I'm having trouble understanding the ROC curve.

Is there any advantage / improvement in area under the ROC curve if I build different models from each unique subset of the training set and use it to produce a probability?
For example, if $y$ has values of $\{a, a, a, a, b, b, b, b\}$, and I build model $A$ by using $a$ from 1st-4th values of $y$ and 8th-9th values of $y$ and build model $B$ by using remained train data. Finally, generate probability. Any thoughts / comments will be much appreciated.

Here is r code for better explanation for my question:

Y    = factor(0,0,0,0,1,1,1,1)
X    = matirx(rnorm(16,8,2))
ind  = c(1,4,8,9)
ind2 = -ind

mod_A    = rpart(Y[ind]~X[ind,])
mod_B    = rpart(Y[-ind]~X[-ind,])
mod_full = rpart(Y~X)

pred = numeric(8)
pred_combine[ind]  = predict(mod_A,type='prob')
pred_combine[-ind] = predict(mod_B,type='prob')
pred_full          = predict(mod_full, type='prob')

So my question is, area under ROC curve of pred_combine vs pred_full.

Best Answer

I'm not sure I got the question, but since the title asks for explaining ROC curves, I'll try.

ROC Curves are used to see how well your classifier can separate positive and negative examples and to identify the best threshold for separating them.

To be able to use the ROC curve, your classifier has to be ranking - that is, it should be able to rank examples such that the ones with higher rank are more likely to be positive. For example, Logistic Regression outputs probabilities, which is a score you can use for ranking.

Drawing ROC curve

Given a data set and a ranking classifier:

  • order the test examples by the score from the highest to the lowest
  • start in $(0, 0)$
  • for each example $x$ in the sorted order
    • if $x$ is positive, move $1/\text{pos}$ up
    • if $x$ is negative, move $1/\text{neg}$ right

where $\text{pos}$ and $\text{neg}$ are the fractions of positive and negative examples respectively.

This nice gif-animated picture should illustrate this process clearer

building the curve

On this graph, the $y$-axis is true positive rate, and the $x$-axis is false positive rate. Note the diagonal line - this is the baseline, that can be obtained with a random classifier. The further our ROC curve is above the line, the better.

Area Under ROC

area under roc

The area under the ROC Curve (shaded) naturally shows how far the curve from the base line. For the baseline it's 0.5, and for the perfect classifier it's 1.

You can read more about AUC ROC in this question: What does AUC stand for and what is it?

Selecting the Best Threshold

I'll outline briefly the process of selecting the best threshold, and more details can be found in the reference.

To select the best threshold you see each point of your ROC curve as a separate classifier. This mini-classifiers uses the score the point got as a boundary between + and - (i.e. it classifies as + all points above the current one)

Depending on the pos/neg fraction in our data set - parallel to the baseline in case of 50%/50% - you build ISO Accuracy Lines and take the one with the best accuracy.

Here's a picture that illustrates that and for details I again invite you to the reference

selecting best threshold

Reference

Related Question