I'm comparing the performance of two algorihtms $A$ and $B$ based on a metric.
For each algorithm, I have $31$ independent samples that represent its performance. This samples are grouped in two sets: $X$ and $Y$ (algoritmh $A$ and algorithm $B$, respectively).
For this task, I'm using a Wilcoxon rank-sum test with a level of significance $\alpha=0.05$ on R
I have selected this nonparametric test because it makes no assumption
about data distribution.
This is my methodology:
-
Hypothesis
Let $E(X)$ and $E(Y)$ be the means of $X$ and $Y$ respectively.
Then, the one-tailed test is defined as follows:-$H_0$: $E(X) = E(Y)$
(the performance of both algorithms is similar)-$H_1$: $E(X) > E(Y)$
(the performance of $A$ is better than $B$) -
Means
$E(X) = 36.87548$, $E(Y)=37.72585$
> summary(X)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 35.09 45.34 36.88 46.63 48.05> summary(Y)
Min. 1st Qu. Median Mean 3rd Qu. Max.
9.332 34.860 41.120 37.730 42.510 48.110 -
Compare $A$ vs $B$
p = wilcox.test(X, Y, alternative="greater", conf.level=0.95)$p.value
If $p=0.0170 < \alpha = 0.05$, then $E(X)$ should be greater than $E(Y)$ (It's what I think, but it's not greater!), We can reject $H_0$, so we can conclude that: algorithm $A$ outperforms algorithm $B$ with a significance level of $\alpha=0.05$.
-
Compare $B$ vs $A$
p = wilcox.test(Y, X, alternative="greater", conf.level=0.95)$p.value
If $p=0.9835 > \alpha$, then $E(X)$ and $E(Y)$ are similar (Again, It's my guess). We can not reject $H_0$, therefore we can conclude that: $A$ and $B$ have similar performance.
Data
X = c( 45.51885768, 35.65081119, 44.60124311, 15.39979541, 48.05143243, 47.90604081,
7.58163868, 0.00000000, 40.94718019, 45.34194687, 28.55451125, 46.15113458,
48.03542321, 47.91413840, 45.38912357, 47.10730083, 47.42726563, 47.80316539,
34.51956662, 0.05853162, 45.29245167, 48.00199937, 45.28839538, 44.89017125,
45.47222435, 7.32111177, 43.35755055, 45.52413737, 45.53528261, 45.45233121,
3.04515436)
Y = c( 41.603296, 43.005620, 38.345339, 28.483733, 41.342548, 44.344933, 34.030309,
42.012604, 35.718175, 45.203532, 40.482022, 45.345594, 41.155187, 41.141522,
48.111677, 41.117034, 34.713158, 44.972073, 35.091889, 34.206018, 9.332199,
39.776291, 28.236449, 13.792789, 35.016681, 41.699073, 41.203090, 47.988765,
36.496740, 45.346652, 30.186485)
Questions
-
is The Wilcoxon rank sum test based on means (as I think in steps 3 and 4) or is based on medians?
-
Is it possible that $A$ can be better than $B$ and at the same time $E(X) < E(Y)$?
-
Is the following methodology correct?
for x in (X, Y) for y in (X, Y) if x == y: continue p = wilcox.test(x, y, alternative="greater", conf.level=0.95) if p < alpha: x is better than y else if p > alpha: p = wilcox.test(x, y, alternative="less", conf.level=0.95) if p < alpha: x is worse than y else x and y have similar performance
-
If I wish to test $A$ against several algorithms $B$, $C$, $…$, what could be the best approach to take?
Best Answer
This is not quite the case; it makes some assumptions (such as continuity), it just doesn't assume a specific functional form.
Neither. It's the median of pairwise differences (two sample Hodges-Lehmann difference) - that we're dealing with.
See this post for some discussion on that point (near the top of the post).
As whuber quite rightly points out below, under the location-shift alternative, it's a difference in means or medians as much as it is a median of pairwise differences.
See this post for a discussion of both the location-shift alternative and the more general alternative that the Wilcoxon-Mann-Whitney is sensitive to; there's some more discussion at the end of the post here
Certainly, if by 'better' you mean "has a high median pairwise difference".
Note that your density displays show a roughly similar asymmetric shape but quite different spread; that's one way (of a number of ways) you might see it. Different shapes but similar spread can also produce it. If there's only a shift in location, the difference in population means and population median-pairwise-difference will be the same - but even with a pure location-shift in the populations, the samples might show opposite shifts.
As expressed I don't understand it. For example, the comparison "if x==y" doesn't make sense - why would the samples be identical, and if they were, what would be the point in proceeding, since no test can find a difference?
What would be best depends on many things which I don't have the information to answer (if you want a nonparametric test I'd suggest considering permutation tests with good power against whatever alternative is of primary interest). The $k$-sample equivalent of the Wilcoxon-Mann-Whitney would be the Kruskal-Wallis test, so if you're happy with the WMW, you might consider the KW.