Solved – Difference between pROC and ROCR in compute time and accuracy

auccomputational-statisticsrroc

I've been calculating receiver operating characteristic (ROC) curves on very large datasets for my thesis. I tried to run these in the pROC R package but the compute time was very long with large datasets. I resolved this issue by using the ROCR R package instead. These packages appear to calculate the AUROC differently:

On data inputs of increasing size ($n$), the compute time of pROC increases $n^2$.
On data inputs of increasing size ($n$), the compute time of ROCR increases linearly with respect to $n$.

It has also not escaped my notice that these packages can get different estimates for AUC: Differences in AUC calculation in R between pROC and AUC

What is the difference in how the AUROC is calculated and should I be concerned that ROCR may be less accurate?

Best Answer

There are two different algorithms used here.

The naive approach is to list all the thresholds of the ROC curve (typically $O(N)$ with continuous variables), and calculate sensitivity and specificity on each of them (again $O(N)$). Because of the two $O(N)$, this has a worst-case of $O(N^2)$. But it can be surprisingly efficient when the predictor takes only a very few values. This is the approach taken by pROC by default.

The second approach is to realize that if you sort the predictions in increasing order, the true positives and false positives rates are the cummulative sums of the observations belonging to each class. This last step can be calculated in $O(N)$ and it is easy to get the specificity and sensitivity from that. This is what ROCR does.

Of course you need to sort the predictions first. By default, R uses shell sort which has a complexity of $O(N log N)$. In practice the sort operation is pretty much instantaneous at the N we're talking about (I guess we're below 1e9) and can be neglected, leaving an apparent $O(n)$.

These two algorithms are equally accurate, as I have just explained in the question you linked (I wasn't aware of this question earlier). I don't think it is necessary to repeat it here, except maybe to note that ROCR behaves in the same way as AUC.

Finally, you can note that pROC has an implementation of the second algorithm as well, that you can turn on with algorithm=2:

roc(..., algorithm = 2)

Best Answer

Related Solutions

Solved – In R how to compute the p-value for area under ROC

Solved – Possible to optimize for area under the precision-recall curve in glmnet logistic regression

Related Question