The Note
in the help on the wilcox.test
function clearly explains why R's value is smaller than yours:
Note
The literature is not unanimous about the definitions of the Wilcoxon rank sum and Mann-Whitney tests. The two most common definitions correspond to the sum of the ranks of the first sample with the minimum value subtracted or not: R subtracts and S-PLUS does not, giving a value which is larger by m(m+1)/2 for a first sample of size m. (It seems Wilcoxon's original paper used the unadjusted sum of the ranks but subsequent tables subtracted the minimum.)
That is, the definition R uses is $n_1(n_1+1)/2$ smaller than the version you use, where $n_1$ is the number of observations in the first sample.
As for modifying the result, you could assign the output from wilcox.test
into a variable, say a
, and then manipulate a$statistic
- adding the minimum to its value and changing its name. Then when you print a
(e.g. by typing a
), it will look the way you want.
To see what I am getting at, try this:
a <- wilcox.test(x,y,correct=FALSE)
str(a)
So for example if you do this:
n1 <- length(x)
a$statistic <- a$statistic + n1*(n1+1)/2
names(a$statistic) <- "T.W"
a
then you get:
Wilcoxon rank sum test with continuity correction
data: x and y
T.W = 156.5, p-value = 0.006768
alternative hypothesis: true location shift is not equal to 0
It's quite common to refer to the rank sum test (whether shifted by $n_1(n_1+1)/2$ or not) as either $W$ or $w$ or some close variant (e.g. here or here). It also often gets called '$U$' because of Mann & Whitney. There's plenty of precedent for using $W$, so for myself I wouldn't bother with the line that changes the name of the statistic, but if it suits you to do so there's no reason why you shouldn't, either.
In scipy.stats, the Mann-Whitney U test compares two populations:
Computes the Mann-Whitney rank test on samples x and y.
but the Wilcoxon test compares two PAIRED populations:
The Wilcoxon signed-rank test tests the null hypothesis that two
related paired samples come from the same distribution. In particular,
it tests whether the distribution of the differences x - y is
symmetric about zero. It is a non-parametric version of the paired
T-test.
EDITED / CORRECTED in response to ttnphns' comments.
Note that the t does not test for whether the distribution of the differences is symmetric about zero, so the Wilcoxon signed rank test is not truly a non-parametric counterpart of the paired t test.
The Mann-Whitney test, on the other hand, assumes that all the observations are independent of each other (no basis for pairing here!). It also assumes that the two distributions are the same, and the alternative is that one is stochastically greater than the other. If we make the additional assumption that the only difference between the two distributions is their location, and the distributions are continuous, then "stochastically greater than" is equivalent to such statements as "the medians are different", so you can, with the extra assumption(s), interpret it that way.
The Mann-Whitney uses a continuity correction by default, but the Wilcoxon doesn't.
The Mann-Whitney handles ties using the midrank, but the Wilcoxon offers three options for handling ties in the paired values (i.e., zero difference between the two elements of the pair.)
It sounds like the Wilcoxon test is the more appropriate for your purposes, since you do have that lack of independence between all observations. However, one might imagine that requests with similar, but not equal, lengths might exhibit similar behavior, whereas the Wilcoxon would assume that if they aren't paired, they are independent. A logistic regression model might serve you better in this case.
Quotes are from the scipy.stats doc pages, which we aren't supposed to link to, apparently.
Best Answer
You should use the signed rank test when the data are paired.
You'll find many definitions of pairing, but at heart the criterion is something that makes pairs of values at least somewhat positively dependent, while unpaired values are not dependent. Often the dependence-pairing occurs because they're observations on the same unit (repeated measures), but it doesn't have to be on the same unit, just in some way tending to be associated (while measuring the same kind of thing), to be considered as 'paired'.
You should use the rank-sum test when the data are not paired.
That's basically all there is to it.
Note that having the same $n$ doesn't mean the data are paired, and having different $n$ doesn't mean that there isn't pairing (it may be that a few pairs lost an observation for some reason). Pairing comes from consideration of what was sampled.
The effect of using a paired test when the data are paired is that it generally gives more power to detect the changes you're interested in. If the association leads to strong dependence*, then the gain in power may be substantial.
* specifically, but speaking somewhat loosely, if the effect size is large compared to the typical size of the pair-differences, but small compared to the typical size of the unpaired-differences, you may pick up the difference with a paired test at a quite small sample size but with an unpaired test only at a much larger sample size.
However, when the data are not paired, it may be (at least slightly) counterproductive to treat the data as paired. That said, the cost - in lost power - may in many circumstances be quite small - a power study I did in response to this question seems to suggest that on average the power loss in typical small-sample situations (say for n of the order of 10 to 30 in each sample, after adjusting for differences in significance level) may be surprisingly small.
[If you're somehow really uncertain whether the data are paired or not, the loss in treating unpaired data as paired is usually relatively minor, while the gains may be substantial if they are paired. This suggests if you really don't know, and have a way of figuring out what is paired with what assuming they were paired -- such as the values being in the same row in a table, it may in practice may make sense to act as if the data were paired to be safe -- though some people may tend to get quite exercised over you doing that.]