I've been calculating receiver operating characteristic (ROC) curves on very large datasets for my thesis. I tried to run these in the pROC
R package but the compute time was very long with large datasets. I resolved this issue by using the ROCR
R package instead. These packages appear to calculate the AUROC differently:
- On data inputs of increasing size ($n$), the compute time of
pROC
increases $n^2$. - On data inputs of increasing size ($n$), the compute time of
ROCR
increases linearly with respect to $n$.
It has also not escaped my notice that these packages can get different estimates for AUC: Differences in AUC calculation in R between pROC and AUC
What is the difference in how the AUROC is calculated and should I be concerned that ROCR
may be less accurate?
Best Answer
There are two different algorithms used here.
The naive approach is to list all the thresholds of the ROC curve (typically $O(N)$ with continuous variables), and calculate sensitivity and specificity on each of them (again $O(N)$). Because of the two $O(N)$, this has a worst-case of $O(N^2)$. But it can be surprisingly efficient when the predictor takes only a very few values. This is the approach taken by
pROC
by default.The second approach is to realize that if you sort the predictions in increasing order, the true positives and false positives rates are the cummulative sums of the observations belonging to each class. This last step can be calculated in $O(N)$ and it is easy to get the specificity and sensitivity from that. This is what
ROCR
does.Of course you need to sort the predictions first. By default, R uses shell sort which has a complexity of $O(N log N)$. In practice the sort operation is pretty much instantaneous at the N we're talking about (I guess we're below 1e9) and can be neglected, leaving an apparent $O(n)$.
These two algorithms are equally accurate, as I have just explained in the question you linked (I wasn't aware of this question earlier). I don't think it is necessary to repeat it here, except maybe to note that
ROCR
behaves in the same way asAUC
.Finally, you can note that
pROC
has an implementation of the second algorithm as well, that you can turn on withalgorithm=2
: