I have selected this nonparametric test because it makes no assumption about data distribution.
This is not quite the case; it makes some assumptions (such as continuity), it just doesn't assume a specific functional form.
is The Wilcoxon rank sum test based on means (as I think in steps 3 and 4) or is based on medians?
Neither. It's the median of pairwise differences (two sample Hodges-Lehmann difference) - that we're dealing with.
See this post for some discussion on that point (near the top of the post).
As whuber quite rightly points out below, under the location-shift alternative, it's a difference in means or medians as much as it is a median of pairwise differences.
See this post for a discussion of both the location-shift alternative and the more general alternative that the Wilcoxon-Mann-Whitney is sensitive to; there's some more discussion at the end of the post here
Is it possible that A can be better than B and at the same time $E(X)<E(Y)$?
Certainly, if by 'better' you mean "has a high median pairwise difference".
Note that your density displays show a roughly similar asymmetric shape but quite different spread; that's one way (of a number of ways) you might see it. Different shapes but similar spread can also produce it. If there's only a shift in location, the difference in population means and population median-pairwise-difference will be the same - but even with a pure location-shift in the populations, the samples might show opposite shifts.
Is the following methodology correct?
As expressed I don't understand it. For example, the comparison "if x==y" doesn't make sense - why would the samples be identical, and if they were, what would be the point in proceeding, since no test can find a difference?
If I wish to test A against several algorithms B, C, ..., what could be the best approach to take?
What would be best depends on many things which I don't have the information to answer (if you want a nonparametric test I'd suggest considering permutation tests with good power against whatever alternative is of primary interest). The $k$-sample equivalent of the Wilcoxon-Mann-Whitney would be the Kruskal-Wallis test, so if you're happy with the WMW, you might consider the KW.
Best Answer
Note that it's quite possible for two continuous distributions to yield rank-sums equal to their expected value under the null, or rank sums that differ by the smallest possible amount (1, as in this case). In either case all other arrangements would be "at least as extreme" in the two-tailed test, so the p-value would be 1.
Which is to say, you can quite easily get the p-value being exactly 1 without any of the values being the same as any other values.
For example, imagine we have the following 22 (combined & sorted) sample values:
Then if (for example) the two groups of 11 had the following items from that list:
(i.e. these now represent the ranks).
Which is to say the two groups have the following data:
Then the sum of ranks in the two groups differ only by 1 (and without ties it's not possible for them to differ by less), and the p-value must then be exactly 1:
Yet both the means and medians are different.
[There are many ways to split the values 1,2,...,22 up into two sets of 11 so that the sum of each set is either 126 or 127 -- i.e. 253/2 rounded up or down; this particular one just happened to be easy to obtain.]
Note that the Wilcoxon rank sum test is not a test of means nor a test of medians, and both may differ while the test sees the two samples as not different. Alternatively, you could be in a situations where you have both the means being the same, or both the medians being the same (even both means and medians equal across samples at the same time) while at the same time the Wilcoxon rank sum rejects the null (because it doesn't consider either of them).
(I'd regard the advice in comments of "try a t-test" to amount to p-hacking. I see no reason whatever to abandon the test you did.)