In scipy.stats, the Mann-Whitney U test compares two populations:
Computes the Mann-Whitney rank test on samples x and y.
but the Wilcoxon test compares two PAIRED populations:
The Wilcoxon signed-rank test tests the null hypothesis that two
related paired samples come from the same distribution. In particular,
it tests whether the distribution of the differences x - y is
symmetric about zero. It is a non-parametric version of the paired
T-test.
EDITED / CORRECTED in response to ttnphns' comments.
Note that the t does not test for whether the distribution of the differences is symmetric about zero, so the Wilcoxon signed rank test is not truly a non-parametric counterpart of the paired t test.
The Mann-Whitney test, on the other hand, assumes that all the observations are independent of each other (no basis for pairing here!). It also assumes that the two distributions are the same, and the alternative is that one is stochastically greater than the other. If we make the additional assumption that the only difference between the two distributions is their location, and the distributions are continuous, then "stochastically greater than" is equivalent to such statements as "the medians are different", so you can, with the extra assumption(s), interpret it that way.
The Mann-Whitney uses a continuity correction by default, but the Wilcoxon doesn't.
The Mann-Whitney handles ties using the midrank, but the Wilcoxon offers three options for handling ties in the paired values (i.e., zero difference between the two elements of the pair.)
It sounds like the Wilcoxon test is the more appropriate for your purposes, since you do have that lack of independence between all observations. However, one might imagine that requests with similar, but not equal, lengths might exhibit similar behavior, whereas the Wilcoxon would assume that if they aren't paired, they are independent. A logistic regression model might serve you better in this case.
Quotes are from the scipy.stats doc pages, which we aren't supposed to link to, apparently.
This is not a problem of the t-test, but of any test in which the power of the test depends on the sample size. This is called "overpowering". And yes, changing the test to Mann-Whitney will not help.
Therefore, apart from asking whether the results are statistically significant, you need to ask yourself whether the observed effect size is significant in the common sense of the word (i.e., meaningful). This requires more than statistical knowledge, but also your expertise in the field you are investigating.
In general, there are two ways you can look at the effect size. One way is to scale the difference between the means in your data by its standard deviation. Since standard deviation is in the same units as your means and describes the dispersion of your data, you can express the difference between your groups in terms of standard deviation. Also, when you estimate the variance / standard deviation in your data, it does not necessarily decrease with the number of samples (unlike standard deviation of the mean).
This is, for example, the reasoning behind Cohen's $d$:
$$d = \frac{ \bar{x}_1 - \bar{x}_2 }{ s}$$
...where $s$ is the square root of the pooled variance.
$$s = \sqrt{\frac{ s_1^2\cdot(n_1-1) + s_2^2\cdot(n_2 - 1) }{ N - 2 } }$$
(where $N=n_1+n_2$ and $s_1$ and $s_2$ are the standard deviations in group 1 and 2, respectively; that is, $s_1 = \sqrt{ \frac{\sum(x_i-\bar{x_1})^2 }{n_1 -1 }} $).
Another way of looking at the effect size -- and frankly, one that I personally prefer -- is to ask what part (percentage) of the variability in the data can be explained by the estimated effect. You can estimate the variance between and within the groups and see how they relate (this is actually what ANOVA is, and t-test is in principle a special case of ANOVA).This is the reasoning behind the coefficient of determination, $r^2$, and the related $\eta^2$ and $\omega^2$ stats. Now, in a t-test, $\eta^2$ can easily be calculated from the $t$ statistic itself:
$$\eta^2 = \frac{ t^2}{t^2 + n_1 + n_2 - 2 }$$
This value can be directly interpreted as "fraction of variance in the data which is explained by the difference between the groups". There are different rules of thumb to say what is a "large" and what is a "small" effect, but it all depends on your particular question. 1% of the variance explained can be laughable, or can be just enough.
Best Answer
Nothing out of the ordinary is going on from the sound of it.
Have you looked at the range of possible values for the statistic?
For the usual form of the U-statistic, it can take values between $0$ and $mn$ where $m$ and $n$ are the two sample sizes.
If you divide the statistic by $mn$, you get $\frac{U}{mn}$, which is the proportion (rather than the count) of cases in which a value from one sample exceeds a value from the other, which takes values between $0$ and $1$. The null case corresponds to an expected proportion of $\frac12$ (with standard error $\sqrt{\frac{m+n+1}{12mn}}$).
Alternatively, you could look at a z-score, which you may find somewhat more intuitive than the raw test statistic.
Certainly. Unless your sample sizes are very small, extremely small p-values are possible.
For a one-tailed test the p-value may be as small as $\frac{m!\, n!}{(m+n)!}$ and twice that for a two-tailed test. For example with small sample sizes of $m=n=10$, you could see a two-tailed p-value as small as $1/92378$ or about $0.000011$ and the smallest available p-values decrease very rapidly as sample sizes increase. Doubling both sample sizes to $m=n=20$ reduces the smallest possible p-value by a factor of about $746000$, to $1.45\times 10^{-11}$.
A small to moderate effect size with large samples or a large effect size with smaller samples can both do it.