Solved – Large difference between Mann-Whitney test and Wilcoxon signed rank test significance

wilcoxon-mann-whitney-testwilcoxon-signed-rank

I am looking at a success of textual requests. We have a dataset of matched pairs, one successful, one not successful, and they are matched based on length of the request in words (since I want to eliminate this effect). I want to determine which features are significant for success.

For some features, e.g. the number of posts of the same user before this request, I get very different significance results for a Mann Whitney U and a Wilcoxon Signed Rank Test (I am using Python/SciPy.stats for this):

Mann Whitney U, 5.91e-6 (one sided)

Wilcoxon signed Rank, 1.4e-2 (two sided)

Why is that? I am not a statistician but I am surprised by this result.
"Mann-Whitney U and Wilcoxon Matched pairs are basically the same in that they compare between two medians to suggest whether both samples come from the same population or not." from http://www.le.ac.uk/bl/gat/virtualfc/Stats/nonpcom.html

What assumptions am I missing or what explains this this gap?

Best Answer

In scipy.stats, the Mann-Whitney U test compares two populations:

Computes the Mann-Whitney rank test on samples x and y.

but the Wilcoxon test compares two PAIRED populations:

The Wilcoxon signed-rank test tests the null hypothesis that two related paired samples come from the same distribution. In particular, it tests whether the distribution of the differences x - y is symmetric about zero. It is a non-parametric version of the paired T-test.

EDITED / CORRECTED in response to ttnphns' comments.

Note that the t does not test for whether the distribution of the differences is symmetric about zero, so the Wilcoxon signed rank test is not truly a non-parametric counterpart of the paired t test.

The Mann-Whitney test, on the other hand, assumes that all the observations are independent of each other (no basis for pairing here!). It also assumes that the two distributions are the same, and the alternative is that one is stochastically greater than the other. If we make the additional assumption that the only difference between the two distributions is their location, and the distributions are continuous, then "stochastically greater than" is equivalent to such statements as "the medians are different", so you can, with the extra assumption(s), interpret it that way.

The Mann-Whitney uses a continuity correction by default, but the Wilcoxon doesn't.

The Mann-Whitney handles ties using the midrank, but the Wilcoxon offers three options for handling ties in the paired values (i.e., zero difference between the two elements of the pair.)

It sounds like the Wilcoxon test is the more appropriate for your purposes, since you do have that lack of independence between all observations. However, one might imagine that requests with similar, but not equal, lengths might exhibit similar behavior, whereas the Wilcoxon would assume that if they aren't paired, they are independent. A logistic regression model might serve you better in this case.

Quotes are from the scipy.stats doc pages, which we aren't supposed to link to, apparently.

Related Solutions

Solved – Mann-Whitney U test and paired data

(I'm not sure I really follow your reasoning.) The Mann-Whitney U-test can be used with paired data. It will simply be less powerful. When you ignore the pairing, you are throwing a lot of information away.
I don't really understand this question.
The meaning of p-values here is the same as the meaning of p-values anywhere in frequentist statistics. That is, it is the probability of finding data as far or further from the null value if the null hypothesis is true. It may help you to read this CV thread: What is the meaning of p values and t values in statistical tests?

Solved – Critical value for Wilcoxon one-sample signed-rank test in R

You can use the qsignrank() function. Example:

> qsignrank(.025, 10, lower.tail=FALSE)
46

This means that for a sample size of 10 and a two-sided test with a significance level of 5%, the test statistic must be greater than 46 (i.e., 47 or greater) to be statistically significant. Example data:

> set.seed(1)
> x = rnorm(10, .5)
> wilcox.test(x)

    Wilcoxon signed rank test

data:  x
V = 47, p-value = 0.04883
alternative hypothesis: true location is not equal to 0

Here the test statistic is 47, and significant at the 5% level.

Note that for a two-sided test, the test statistic returned by qsignrank() is the larger of the two possible test statistics. For example, wilcox.test(-x) gives a test statistic of 8, which can be transformed into 47 by $\frac{10\cdot 11}{2}-8$.

Best Answer

Related Solutions

Solved – Mann-Whitney U test and paired data

Solved – Critical value for Wilcoxon one-sample signed-rank test in R

Related Question