The paper you cite explains the method in the following terms:
[...] we show the statistical significance of the difference between the
performance of ESA-Wikipedia (March 26, 2006) version) and that of
other algorithms by using Fisher's z-transformation (Press,
Teukolsky,Vetterling, & Flannery, Numerical Recipes in C: The Art of
Scientific Computing. Cambridge University Press, 1997, Section 14.5).
I suggest you follow that reference, or have a look at the Wikipedia page on the Spearman coefficient for details.
It is informative to see exactly what the Mann-Whitney test does. For two samples $X = \{x_1, \dots, x_m \}$ and $Y=\{y_1, \dots, y_n\}$, under the assumptions that
- Observations in $X$ are iid
- Observations in $Y$ are iid
- The samples $X$ and $Y$ are mutually independent.
- The respective populations from which $X$ and $Y$ were sampled are continuous.
then, the U statistic is defined as:
$$ U = \sum_{i=1}^m \sum_{j=1}^n bool(x_i < y_j )$$
It should be reasonably intuitive to see that if X and Y represent the same distributions (i.e. the null hypothesis), then the expected value of $U$ would $mn/2$, since you could expect values below a certain rank to occur as often for $X$ as for $Y$. So you can think of the Mann Whitney test as checking to what extent the statistic $U$ deviates from this expected value.
If this intuition isn't clear, then think of the first rank (i.e. the leftmost rarest value in each sample). If $X$ and $Y$ were drawn from the same distribution, you would have no reason to expect that the rarest value in $X$ would be less than $Y$ more than 50% of the time, otherwise this would make you think that actually $X$ has a heavier tail than $Y$. You can extend this logic for the 2nd rarest value, 3rd, and so forth.
Similarly, if you drew the same number of observations, say $K$, you could almost think of the ranks as $K$ "common bins" with fuzzy boundaries. If $X$ and $Y$ came from the same population, you might expect each rank to occupy roughly the same space, and there's no reason to think that the $x_k $observation in that bin would be to the right of $y_k$ more than 50% of the time.
However, if $x_k$ at a particular "bin" $k$ was to the right of $y_k$ more often than not, then this denotes that there is a systematic "shift". This is what makes Mann-Whitney a good test for detecting 'shift' in distributions that are assumed to be relatively similar except for a possible shift due to a treatment effect.
Now consider the $X \sim \mathcal N(0,1)$ vs $Y \sim \mathcal N(0,2)$ scenario. Assume $K=1000$ samples in each case. You would expect that for the most part, given the same rank, negative values in Y, would tend to be to the left of X more or less all the time. Whereas, positive values in Y, would tend to be to the right of X more or less all the time. Therefore in this particular scenario, even though the distributions are completely different, it happens that half the time X is less likely to be larger than Y, and half the time it is more likely. Therefore you'd expect the U statistic to be very close to the expected value $K^2/2$, and therefore unlikely to be significant.
In other words, it may be a reasonable test to compare two samples in a general "goodness of fit" sense in some specific circumstances, but it is important to be familiar with the situations where it would not. The example above is one such case.
Best Answer
This kind of task is solved by ANCOVA (analysis of covariance). According to its model, weight is dependent on two effects (apart from constant), the group effect and the covariate (height) effect: $weight=constant+group+height$. Here group effect is clean, in the sense that the possible difference the two groups in average height is washed out. So you may safely rely on significance of group effect. Before you do the above analysis you should make sure that the strength of dependency of weight on height doesn't differ in the two groups. To do it, try the model with the interaction term: $weight=constant+group+height+group*height$. If the interaction is nonsignificant you can turn to the above two-effect model (while if it is significant you should apply a nested model that is a bit more complex).
The above simple approach however assumes that weight is dependent on height linearly, and we know that it is certainly not true. Another nasty thing is that weight depends on height heteroscedastically, that is, variation of weight is larger for big heights than for small heights. What to do? One way is to transform weight prior to the analysis so that weight by height scatter-cloud is about linear and homoscedastic. Another opportunity is to try generalized linear model instead of classic ANCOVA. That procedure offers various link functions which in fact perform the transformation for you implicitly.