Solved – Wilcoxon signed rank test for comparing classifiers

classificationregression-strategiesscipywilcoxon-signed-rank

I am interested in comparing Classifier A with Classifier B. I have obtained Micro-Averaged F1 measures for Classifiers A and B that I intend to compare pairwise. I want to find out if Classifier A is better than B.

I am a little unclear on how actually to conduct the Wilcoxon Signed Rank test. As far as my understanding goes, the null hypothesis is that there is no significant difference between the classifiers, and the alternate hypothesis is that there is. Is this correct? If this is indeed true, then how do I in fact show that A is better than B – because in this case even if I fail to reject the null hypothesis, all I have shown is that there's a significant difference in classifier performance, and not that A is better than B…

Best Answer

There are many approaches, most of them not very powerful (e.g., comparing two ROC areas (c-indexes)). Two powerful approaches, most easily done in an independent validation sample, are as follows, after making sure that you get much more than information-losing "classification" out of the "classifiers". Efficient approaches need e.g. estimated probabilities of class membership.

  1. Embed the continuous predicted values from both methods into a "super model" and do a likelihood-ratio $\chi^2$ test for whether method A adds to method B, and the reverse. If one adds to the other and the reverse is not true, the one is clearly better than the other.
  2. Use the R Hmisc package rcorrp.cens to test the null hypothesis that method A is no more concordant with the outcome than method B. This approach is more powerful than testing differences in ROC areas, and it works by forming all possible pairs of pairs of predictions.