I am interested in comparing Classifier A with Classifier B. I have obtained Micro-Averaged F1 measures for Classifiers A and B that I intend to compare pairwise. I want to find out if Classifier A is better than B.
I am a little unclear on how actually to conduct the Wilcoxon Signed Rank test. As far as my understanding goes, the null hypothesis is that there is no significant difference between the classifiers, and the alternate hypothesis is that there is. Is this correct? If this is indeed true, then how do I in fact show that A is better than B – because in this case even if I fail to reject the null hypothesis, all I have shown is that there's a significant difference in classifier performance, and not that A is better than B…
Best Answer
There are many approaches, most of them not very powerful (e.g., comparing two ROC areas (c-indexes)). Two powerful approaches, most easily done in an independent validation sample, are as follows, after making sure that you get much more than information-losing "classification" out of the "classifiers". Efficient approaches need e.g. estimated probabilities of class membership.
Hmisc
packagercorrp.cens
to test the null hypothesis that method A is no more concordant with the outcome than method B. This approach is more powerful than testing differences in ROC areas, and it works by forming all possible pairs of pairs of predictions.