In scipy.stats, the Mann-Whitney U test compares two populations:
Computes the Mann-Whitney rank test on samples x and y.
but the Wilcoxon test compares two PAIRED populations:
The Wilcoxon signed-rank test tests the null hypothesis that two
related paired samples come from the same distribution. In particular,
it tests whether the distribution of the differences x - y is
symmetric about zero. It is a non-parametric version of the paired
T-test.
EDITED / CORRECTED in response to ttnphns' comments.
Note that the t does not test for whether the distribution of the differences is symmetric about zero, so the Wilcoxon signed rank test is not truly a non-parametric counterpart of the paired t test.
The Mann-Whitney test, on the other hand, assumes that all the observations are independent of each other (no basis for pairing here!). It also assumes that the two distributions are the same, and the alternative is that one is stochastically greater than the other. If we make the additional assumption that the only difference between the two distributions is their location, and the distributions are continuous, then "stochastically greater than" is equivalent to such statements as "the medians are different", so you can, with the extra assumption(s), interpret it that way.
The Mann-Whitney uses a continuity correction by default, but the Wilcoxon doesn't.
The Mann-Whitney handles ties using the midrank, but the Wilcoxon offers three options for handling ties in the paired values (i.e., zero difference between the two elements of the pair.)
It sounds like the Wilcoxon test is the more appropriate for your purposes, since you do have that lack of independence between all observations. However, one might imagine that requests with similar, but not equal, lengths might exhibit similar behavior, whereas the Wilcoxon would assume that if they aren't paired, they are independent. A logistic regression model might serve you better in this case.
Quotes are from the scipy.stats doc pages, which we aren't supposed to link to, apparently.
The Wilcoxon signed rank test has a null distribution that rapidly approaches a normal distribution.
The tables tend to stop by n=50 because the normal approximation is excellent well before that point. Indeed, there's probably little point tabulating much beyond n=20. The normal approximation is given at the Wikipedia page for the test -- but you need to make sure you're using the same version of the statistic (there's more than one definition going around; they should all give the same p-values though). Wikipedia's version uses the sum of all the signed ranks.
If you use R (or a number of other statistical packages), they'll happily give critical values for one and two tailed tests. Again, you have to make sure you're using the same definition of the statistic as they do (R uses "the sum of the positive ranks" as the statistic).
Using R's definition of the statistic, at n=63, the 5% two tailed critical value is 1294; the 5% (upper) one tailed critical value is 1248.
Using the corresponding normal approximation (with or without continuity correction) gives the same values.
To get a p-value using a normal approximation you need:
the mean and standard deviation of the particular statistic you're using, when $H_0$ is true. You can (for example) then compute a standardized version of the test statistic (which is approximately normally distributed) if you wish - though with computer packages you can avoid the need to standardize.
You can then use normal tables or computer functions for the normal distribution to obtain a p-value, or you can simply compare your statistic with critical values for your significance level.
Best Answer
You can use the
qsignrank()
function. Example:This means that for a sample size of 10 and a two-sided test with a significance level of 5%, the test statistic must be greater than 46 (i.e., 47 or greater) to be statistically significant. Example data:
Here the test statistic is 47, and significant at the 5% level.
Note that for a two-sided test, the test statistic returned by
qsignrank()
is the larger of the two possible test statistics. For example,wilcox.test(-x)
gives a test statistic of 8, which can be transformed into 47 by $\frac{10\cdot 11}{2}-8$.