Solved – Fastest way to compare ROC curves

aucclassificationcross-validationmodel selectionroc

I have a set of true positive (TP) values which are used to train a model.

I am using 5-fold cross validation to train my model (i.e. split my true positives into 5, use 4/5ths for training and 1/5th for testing)

I repeat this using different 1/5ths as the test set.
For each run, I have a large set of mixed true positives / true negatives which I use my trained model to attempt to classify. I then obtain an ROC curve.
This is done for each run of the cross validation (i.e. I end up with 5 ROC curves)

I then average the AUC and return it.

My problem:

I have two methods of classification: call them method A and method B.
for each method, I get 5 ROC curves.
How can I determine which method gives me a better ROC if I have more than one ROC for each?

I know computing the AUC and averaging for each method, then comparing averaged AUCs is NOT a good approach.

Note: I have more than 1 model (roughly 120). I just explained in terms of one model for simplicity. So I have 120 models, each one having classified the data using method A and method B, and for each method A & B there are 5 ROCs from cross validation.

Edit


My problem more specifically is that I have >100 sets of sequences, and for each set I construct a position weight matrix, which I then use to score against all sets merged together. I have several scoring schemes so I'd like to determine which ones give me the best classification. For this, I use cross validation: split my data into 5 for each set, train my pwm with 4/5ths of the data and test it on 1/5th. Pool the results from 5 runs, and plot an AUC.

Best Answer

There is more to k-fold CV than you do. In essence, the idea of using those crazy splits instead of simply making a few random subsamples is that you can reconstruct the full decision and compare it with original just like you might have done with a predictions on a full train set.

So, sticking to a full k-fold CV mechanism, you just have to merge the predictions from all folds and calculate the ROC for that -- this way you get a single AUROC per model.

However, note that just having two numbers and selecting greater is not a statistically valid way of making comparisons -- without spreads of those two you can't invalidate the hypothesis that both accuracies are roughly the same. So if you are sure you want to do any model selection, you'll need to get those spreads (for instance by bootstrapping the k-fold CV to actually get several AUROC values per classifier) and do some multiple comparison test, probably non-parametric.