I counted the number of geese on an intertidal mudflat on 100+ days over the winter. I made two counts on each of these days: one at low tide and one at high tide. I want to know if the number of geese present differs at high and low tide. As the data are very positively skewed, using a Wilcoxon test is appropriate. However, should I use rank sum test (unpaired test) or signed rank test (paired test)?
Solved – Paired or unpaired Wilcoxon test
paired-datawilcoxon-mann-whitney-testwilcoxon-signed-rank
Related Solutions
Yes, there is. For example, any sampling from distributions with infinite variance will wreck the t-test, but not the Wilcoxon. Referring to Nonparametric Statistical Methods (Hollander and Wolfe), I see that the asymptotic relative efficiency (ARE) of the Wilcoxon relative to the t test is 1.0 for the Uniform distribution, 1.097 (i.e., Wilcoxon is better) for the Logistic, 1.5 for the double Exponential (Laplace), and 3.0 for the Exponential.
Hodges and Lehmann showed that the minimum ARE of the Wilcoxon relative to any other test is 0.864, so you can never lose more than about 14% efficiency using it relative to anything else. (Of course, this is an asymptotic result.) Consequently, Frank Harrell's use of the Wilcoxon as a default should probably be adopted by almost everyone, including myself.
Edit: Responding to the followup question in comments, for those who prefer confidence intervals, the Hodges-Lehmann estimator is the estimator that "corresponds" to the Wilcoxon test, and confidence intervals can be constructed around that.
You should use the signed rank test when the data are paired.
You'll find many definitions of pairing, but at heart the criterion is something that makes pairs of values at least somewhat positively dependent, while unpaired values are not dependent. Often the dependence-pairing occurs because they're observations on the same unit (repeated measures), but it doesn't have to be on the same unit, just in some way tending to be associated (while measuring the same kind of thing), to be considered as 'paired'.
You should use the rank-sum test when the data are not paired.
That's basically all there is to it.
Note that having the same $n$ doesn't mean the data are paired, and having different $n$ doesn't mean that there isn't pairing (it may be that a few pairs lost an observation for some reason). Pairing comes from consideration of what was sampled.
The effect of using a paired test when the data are paired is that it generally gives more power to detect the changes you're interested in. If the association leads to strong dependence*, then the gain in power may be substantial.
* specifically, but speaking somewhat loosely, if the effect size is large compared to the typical size of the pair-differences, but small compared to the typical size of the unpaired-differences, you may pick up the difference with a paired test at a quite small sample size but with an unpaired test only at a much larger sample size.
However, when the data are not paired, it may be (at least slightly) counterproductive to treat the data as paired. That said, the cost - in lost power - may in many circumstances be quite small - a power study I did in response to this question seems to suggest that on average the power loss in typical small-sample situations (say for n of the order of 10 to 30 in each sample, after adjusting for differences in significance level) may be surprisingly small.
[If you're somehow really uncertain whether the data are paired or not, the loss in treating unpaired data as paired is usually relatively minor, while the gains may be substantial if they are paired. This suggests if you really don't know, and have a way of figuring out what is paired with what assuming they were paired -- such as the values being in the same row in a table, it may in practice may make sense to act as if the data were paired to be safe -- though some people may tend to get quite exercised over you doing that.]
Best Answer
The days would appear to be the obvious pairing factor, suggesting a paired test.
Specifically, since you'd generally expect that the high tide/low tide geese count for a given day will tend to be more similar than high tide and low tide geese from two randomly selected days, the data are paired. The typical numbers of geese will tend to go up and down over time (as flocks come in or move on), which leads to that sort of dependence.
Pair-differences may not eliminate all of the inter-day correlation (you should probably check for that via some diagnostic - say a plot of high-low tide difference vs that for the previous day), but it will probably eliminate the major part of it.
Another issue of concern to me is the fact that you have counts. This introduces several features that tend to suggest that neither signed rank nor t-tests are fully suitable:
(i) the discreteness. The signed rank test relies on a continuous distribution of differences. This might be dealt with by simulating the null distribution, but is complicated by (ii)*;
(ii) the variance tends to be related to the mean. This heteroskedasticity will invalidate the t-test, tending to push the distribution more toward heavy tailedness in a way that's hard to quantify (since the result will be a scale mixture over an unknown mixing distribution).
* one possibility would be to do some simulations to quantify the likely impact on the null distribution simulated under one set of count assumptions (say close to the average counts) by trying a few plausible scenarios of varying counts. The actual impact may be quite small.
You have counts; you might be better working with them as counts. There are a number of ways this might proceed (it's possible to construct chi-squared tests, or you might fit some GLM with days as a blocking factor. If you want to treat the days as a random effect then it would be a mixed-effects GLM (aka GLMM).