Wilcoxon-Mann-Whitney Test – Determining Sample Size for Statistical Power

sample-sizeself-studystatistical-powerwilcoxon-mann-whitney-test

You serve two different versions of a website to customers, with the aim of seeing which, if either, is overall better. The two website versions are alternated by time of request arriving on the web server, so first would be one version, then for a new user (different IP address) the alternate version and so on.

Because the users are not really matched, and it is unclear if normality holds, the Wilcoxon–Mann–Whitney test seems a good choice. A test could be done, of those customers who bought something where the value in dollars is the random variable. For those customers who do not purchase, a test could be done with the time (delta between first and last request) spent on the website the random variable. Both tests would be one-sided as one of the website versions (the newer one is designed to be better and these test would be to measure if this is the case).

How would the required sample size to be used calculated? Wikipedia [https://en.wikipedia.org/wiki/Mann–Whitney_U_test] says “[i]t is a widely recommended practice for scientists to report an effect size for an inferential test”. But for calculating the sample size you need effect size as an input? Is sample size orthogonal of what test you are using, or the test you are using an input to the sample size calculation? How is an effect size a priori specified (although problem dependent, a simple example would be good). Because there are two tests, does this impact the sample size calculation?

A simulated answer to help understand this may be a good way of helping understand the issues here.

Best Answer

Very roughly, suppose $\sigma/\Delta = 5,$ (where $\sigma$ is population standard deviation, $\Delta$ is effect size; maybe $\sigma=10, \Delta=2.)$ and desired power is 95%, then for a pooled t test $n \approx 650$ is required in each group (version), according to an on-line calculator here for pooled 2-sample t tests on normal data.

One example: $n=650, \mu_0 = 80, \mu_a = 82, \sigma=10.$

set.seed(1234) # for reproducibility
x1 = rnorm(650, 80, 10)
x2 = rnorm(650, 82, 10)

A pooled t test finds a significant difference for these fictitious data with P-value about $0.0002 < 0.05 = 5\%.$

t.test(x1, x2, var.eq=T)

      Two Sample t-test

data:  x1 and x2
t = -3.7516, df = 1298, p-value = 0.0001834
alternative hypothesis: 
 true difference in means is not equal to 0
95 percent confidence interval:
 -3.1259132 -0.9792445
sample estimates:
mean of x mean of y 
 79.80134  81.85392 

So this seems to be going in the right direction. But is this just a "lucky" sample? How often do we reject, if we do such a test $10^5$ times?

set.seed(2021)
pv = replicate(10^5, t.test(rnorm(650,80,10),                               
               rnorm(650,82,10), var.eq=T)$p.val)
mean(pv <= 0.05)
[1] 0.9507

The power is very nearly the 95% 'promised' by the on-line calculator.

A two-sample Wilcoxon rank sum test gives about the same result when used with normal data; the power is about 94%. [I used fewer iterations because the program runs slowly.]

set.seed(2021)
pv = replicate(10^4, wilcox.test(rnorm(650,80,10), 
                         rnorm(650,82,10))$p.val)
mean(pv <= 0.05)
[1] 0.9411

One run with non-normal data gives about 87% power. The exponential populations involved in this simulation have roughly the same shift and standard deviations as the normal distributions used in the on-line calculator, but they are highly right skewed. Results will vary depending on the shapes of the distributions involved. (Often 80% power is considered good enough. Additional simulation runs show that $n=700$ gives power near 90%; $n=800,$ about 95%.)

set.seed(1011)
pv = replicate(10^4, wilcox.test(rexp(650,1/9), 
                           rexp(650,1/11))$p.val)
mean(pv <= 0.05)
[1] 0.8735

[Note: If the data are known to be exponential, then there is a better test for different means than the nonparametric 2-sample Wilcoxon test.]