Let say I build two machine learning classifiers, A and B, on the same dataset.
I obtain the ROC curves for both A and B, and the AUCs value.
What statistical tests should I use to compare these two classifiers. (Let say A is the one I innovate, and B is a baseline model).
Thanks!
Best Answer
Personally I suggest using a randomized permutation test
Area under curve (AUC) is just one test statistic. You have probably seen that the statistic of A is better than that of B. So it's already established that AUC of A is better than AUC of B. But what is not established is whether this superiority is due to systematic difference, or due to sheer dumb luck.
Therefore, now the question is: is the difference (regardless of which is better than the other) big enough to warrant assuming that the difference is due to systematic differences between methods A and B? In other words:
Generally, if you go with a randomized permutation test, the procedure to estimate the probability above ($p$ value) is: