Machine Learning – How to Determine the Optimal Threshold for a Classifier and Generate ROC Curve

machine learningrocsvm

Let say we have a SVM classifier, how do we generate ROC curve? (Like theoretically) (because we are generate TPR and FPR with each of the threshold). And how do we determine the optimal threshold for this SVM classifier?

Best Answer

Use the SVM classifier to classify a set of annotated examples, and "one point" on the ROC space based on one prediction of the examples can be identified. Suppose the number of examples is 200, first count the number of examples of the four cases.

\begin{array} {|r|r|r|} \hline & \text{labeled true} & \text{labeled false} \\ \hline \text{predicted true} &71& 28\\ \hline \text{predicted false} &57&44 \\ \hline \end{array}


Then compute TPR (True Positive Rate) and FPR (False Positive Rate). $TPR = 71/ (71+57)=0.5547$, and $FPR=28/(28+44) = 0.3889$ On the ROC space, the x-axis is FPR, and the y-axis is TPR. So point $(0.3889, 0.5547)$ is obtained.

To draw an ROC curve, just

  1. Adjust some threshold value that control the number of examples labelled true or false
    For example, if concentration of certain protein above α% signifies a disease, different values of α yield different final TPR and FPR values. The threshold values can be simply determined in a way similar to grid search; label training examples with different threshold values, train classifiers with different sets of labelled examples, run the classifier on the test data, compute FPR values, and select the threshold values that cover low (close to 0) and high (close to 1) FPR values, i.e., close to 0, 0.05, 0.1, ..., 0.95, 1
  2. Generate many sets of annotated examples
  3. Run the classifier on the sets of examples
  4. Compute a (FPR, TPR) point for each of them
  5. Draw the final ROC curve

Some details can be checked in http://en.wikipedia.org/wiki/Receiver_operating_characteristic.

Besides, these two links are useful about how to determine an optimal threshold. A simple method is to take the one with maximal sum of true positive and false negative rates. Other finer criteria may include other variables involving different thresholds like financial costs, etc.
http://www.medicalbiostatistics.com/roccurve.pdf
http://www.kovcomp.co.uk/support/XL-Tut/life-ROC-curves-receiver-operating-characteristic.html

Related Question