Solved – What statistical tests to compare two AUCs from two models on the same dataset

hypothesis testingmachine learningstatistical significance

Let say I build two machine learning classifiers, A and B, on the same dataset.

I obtain the ROC curves for both A and B, and the AUCs value.

What statistical tests should I use to compare these two classifiers. (Let say A is the one I innovate, and B is a baseline model).

Thanks!

Best Answer

Personally I suggest using a randomized permutation test

Area under curve (AUC) is just one test statistic. You have probably seen that the statistic of A is better than that of B. So it's already established that AUC of A is better than AUC of B. But what is not established is whether this superiority is due to systematic difference, or due to sheer dumb luck.

Therefore, now the question is: is the difference (regardless of which is better than the other) big enough to warrant assuming that the difference is due to systematic differences between methods A and B? In other words:

  • What is the probability of you observing that A is better than B under the null hypothesis (which states that A and B have no systematic differences).

Generally, if you go with a randomized permutation test, the procedure to estimate the probability above ($p$ value) is:

  1. Calculate AUC of A vs. B (which I assume you already did).
  2. Create C_1, such that C_1 is a pair-wisely randomly shuffled list of scores from A and B. In other words, C_1 is a simulation of what a random non-systematic difference looks like.
  3. Measure AUC of C_1.
  4. Test if AUC of C_1 is better than AUC of A. If yes, increment counter $damn$.
  5. Repeat step 2 to 4 $n$ many times, but instead of C_1, use C_i where i $\in \{2, 3, \ldots, n\}$. Usually $n=1000$, but since it's asymptotically consistent, you are free to put larger values of $n$ if you have enough CPU time to go higher.
  6. Then, $p = \frac{damn}{n}$.
  7. If $p \le \alpha$, then the difference is significant. Usually $\alpha = 0.05$. Else: we don't know (maybe we need larger data).