Solved – Large difference between Mann-Whitney test and Wilcoxon signed rank test significance

wilcoxon-mann-whitney-testwilcoxon-signed-rank

I am looking at a success of textual requests. We have a dataset of matched pairs, one successful, one not successful, and they are matched based on length of the request in words (since I want to eliminate this effect). I want to determine which features are significant for success.

For some features, e.g. the number of posts of the same user before this request, I get very different significance results for a Mann Whitney U and a Wilcoxon Signed Rank Test (I am using Python/SciPy.stats for this):

Mann Whitney U, 5.91e-6 (one sided)

Wilcoxon signed Rank, 1.4e-2 (two sided)

Why is that? I am not a statistician but I am surprised by this result.
"Mann-Whitney U and Wilcoxon Matched pairs are basically the same in that they compare between two medians to suggest whether both samples come from the same population or not." from http://www.le.ac.uk/bl/gat/virtualfc/Stats/nonpcom.html

What assumptions am I missing or what explains this this gap?

Best Answer

In scipy.stats, the Mann-Whitney U test compares two populations:

Computes the Mann-Whitney rank test on samples x and y.

but the Wilcoxon test compares two PAIRED populations:

The Wilcoxon signed-rank test tests the null hypothesis that two related paired samples come from the same distribution. In particular, it tests whether the distribution of the differences x - y is symmetric about zero. It is a non-parametric version of the paired T-test.

EDITED / CORRECTED in response to ttnphns' comments.

Note that the t does not test for whether the distribution of the differences is symmetric about zero, so the Wilcoxon signed rank test is not truly a non-parametric counterpart of the paired t test.

The Mann-Whitney test, on the other hand, assumes that all the observations are independent of each other (no basis for pairing here!). It also assumes that the two distributions are the same, and the alternative is that one is stochastically greater than the other. If we make the additional assumption that the only difference between the two distributions is their location, and the distributions are continuous, then "stochastically greater than" is equivalent to such statements as "the medians are different", so you can, with the extra assumption(s), interpret it that way.

The Mann-Whitney uses a continuity correction by default, but the Wilcoxon doesn't.

The Mann-Whitney handles ties using the midrank, but the Wilcoxon offers three options for handling ties in the paired values (i.e., zero difference between the two elements of the pair.)

It sounds like the Wilcoxon test is the more appropriate for your purposes, since you do have that lack of independence between all observations. However, one might imagine that requests with similar, but not equal, lengths might exhibit similar behavior, whereas the Wilcoxon would assume that if they aren't paired, they are independent. A logistic regression model might serve you better in this case.

Quotes are from the scipy.stats doc pages, which we aren't supposed to link to, apparently.