Solved – ROC curve and parameter selection

hyperparameterrocsvm

I am employing the concept of ROC curve to select one class SVM classifier's parameters as follows:

I have a dataset which include a normal class and an abnormal class. I train the One class SVM on normal class data and then try to predict the abnormal data. Since I originally know which are normal and which are abnormal, after the classifier predicts its classification, I calculate the True positive rate and false positive rate. I build a grid of parameters gamma (10^-9~10^-2) and nu(0.001~0.01) for the classifier and measure the true positive rates and false positive rates as described above for each hyperparameter combination.

Since I want to determine the best hyperparameters, I plot these TPR and FPR set like in ROC curve concept (i.e plotting 1-fpr in X axis against tpr in Y axis) and select the set of hyperparameters that are closest to the (1,1) point in the graph.

Is ROC supposed to be used in this way ? Can this be called ROC ? Are there any pitfalls by using the ROC concept in this way to determine hyper parameters ?

Note:
The reason I use one class SVM instead of a normal SVM that would require teaching variable y is because in real world deployment of this model I would not be able to get the "y" variable and I can not deploy already trained model because the test cases are too many and I can't develop a model that generalizes well over all the possible types of test cases. So I want the model to be able to recognize "abnormal" from learning what is "normal".

Best Answer

The concordance probability (c-index, AUROC) is just a restatement of the Wilcoxon-Mann-Whitney rank sum U-test, so it is just using the ranks of predicted probabilities. As such, this is not a valid primary measure but should only be used descriptively. You can optimize the c-index by a model that is not the best model, and optimizing to c will not calibrate model predictions. The gold standard objective function, which uses full information and will lead to selecting an actual best model, is the log likelihood function. If you have too many parameters for the sample size to support you use a penalized form of the log likelihood, e.g. ridge, lasso, elastic net, Bayesian skeptical priors, etc.

Note that whenever you use an ROC curve to choose a threshold on a predictor or on predicted probability, you are turning the analysis into a decision problem that is not using the appropriate utility/cost/loss function. See here for details.