Solved – Estimating ROC/AUC on large data sets

auclarge dataroc

Plotting an ROC curve of a classifier compared to cases requires that the data set be sorted first on the classifier score. I am in a position where I need to calculate ROC on a large data set very quickly and sorting is the bottleneck (even using quicksort in C or F90). If instead of calculating ROC by thresholding at each case in the data set I instead threshold at every 100 cases then my execution time decreases by orders of magnitude based upon how I can write the code. The result is an ROC curve with let's say 10,000 points instead of 1,000,000. My tests show that the area under these two curves are the same out to > 5 decimal places.

I would like to use this method but have not ran into anyone trying to speed up calculation in this way. Most of the lit. is on uses of ROC analysis where the data sets are relatively small and execution time is not an issue, so I have not found anyone else using this method or another to speed calculation by "thinning" out the points on the curve.

Has anyone ran into a reference/study that has used or evaluated this or another method for speeding up ROC analysis? If so, or if you have other thoughts, please share.

Best Answer

It makes sense to take a large random sample of N cases (10,000) to estimate the real distribution (the full 1 million). The area under the curve will be an approximate, but a very good approximate as N increases.

If this is something that needs to be done frequently, you can try the ROC calculation with increasingly large sample sizes to find an optimally large sub-set. Optimal here would mean that the loss of information is acceptable. Be warned that a random sample needs to still be representative of the full dataset (whatever that means for your study).

I can't cite any literature off-hand, but I know this type of sampling is used often in practice for different reasons. I for one often use sampling to reduce a large data-set (1-2 million) to something more easily handled (~5-10K) before starting on a data analysis.