Solved – Can the ROC AUC of a total test set be larger than the AUC for any subset of some test set partition

In testing an ML classifier I built, I came across some confusing behavior.

My model is trained on several distinct datasets which I've combined in order to create one total dataset. I constructed test and validation sets by holding out some fraction of the total number of examples from each set. I trained a few models on one of the datasets before I found the other, so was interested in seeing if my new model trained on the combined dataset performed better on the original dataset than models I had trained before on just the original data.

The figure of merit for the task I'm doing is the area under the ROC curve. The performance on the total dataset was better than my old models in this metric. However, I found something very odd. When I partition my test set into subsets defined by which dataset the example came from, the ROCAUC of each partition is often lower than the ROC AUC of the combined test-set.

My expectation was that I should get a ROC AUC on the total dataset that is in between the ROC AUC of the partitions. Granted there is not a large discrepancy, normally it is a difference of about 0.01-0.02 higher than the performance on the best partition.

I suppose if the two datasets generally give regression values in my ML model that are biased towards different mean values, it's possible that the combination essentially stretches out my ROC curve, making some kind of artifact, but I don't see an easy way to prove that this can be the case.

Has anyone come across this before? Is there a bug in my code?

As an example in context:

For the sake of clarity, I'm going to also give an example of what I'm seeing in context…

Suppose I make a big dataset out of pictures of cats and dogs, and I try to build a classifier that guesses whether an animal has been to the vet in the last year. I build a test set from some number, A, of cat pics, and some number, B, of dog pics (so that A/B reflects the ratio of the number of cat/dog pics in the entire corpus). Is it possible that a classifier could have a ROC AUC on the combined test set of cat and dog images that is larger than the ROC AUC for just the dog images in the test set and just the cat images in the test set?

Best Answer

Yes, it's possible. An alternative definition of AUC is the probability that a randomly chosen ground-truth positive sample ranks higher than a ground-truth negative sample. So for example if we have class A having scores and classifications as:

[0.09,0.5,0.7]

[-,+, -]

Then the AUC = 1/2

For class B:

[0.095,0.41,0.42]

Then AUC = 1/2.

Combining the two:

[0.09,0.095,0.41,0.42,0.5,0.7]

[-,-,+,-,+,-]

AUC = (1/2)(1/2)+(3/4)(1/2) = 0.625.

As an example in context:

Best Answer

Related Solutions

Solved – ROC curves and AUC in simulations to compare models

Solved – Do these Precision-Recall (PR) curves indicate good classification performances

Related Question