Solved – Required number of permutations for a permutation-based p-value

hypothesis testingp-valuepermutation-testresampling

If I need to calculate a permutation-based $p$-value with significance level $\alpha$, how many permutations do I need ?

From the article "Permutation Tests for Studying Classifier Performance", page 5 :

In practice, the upper bound $1/(2\sqrt{k})$ is typically used to determine the number of
samples required to achieve the desired precision of the test.

… where $k$ is the number of permutations.

How do I calculate the number of required permutations from this formula ?

Best Answer

I admit, the paragraph might be confusing.

When performing a permutation test you do estimate a p-value. The issue is, that the estimation of the p-value has an error itself which is calculated as $\sqrt{\frac{p(1-p)}{k}}$. If the error is too large, the p-value is unreliable.

So how many permutations k does one need to get a reliable estimate ?

First define your maximum allowed error aka the precision. Let this be $P$. Then an estimated p-value shall be in the interval $[p-3*P,p+3*P]$ (since p is approximately normal distributed)

Using the upper bound

The cited paragraph of the paper suggests to use $\frac{1}{2\sqrt{k}}$ as an upper bound estimate of the error instead of $\sqrt{\frac{p(1-p)}{k}}$. This corresponds to a unknown p-value of p=0.5 (where the error is maximum among all ps for a fixed k).

So: You want to know k where $\frac{1}{2\sqrt{k}}\le P$.

<=> $\frac{1}{4P^2}\le k$

But since the cited formula represents an upper bound, this approach is very rough.

Using the error at the significance level

Another approach uses the desired significance level $\alpha$ as p to calculate the required precision. This is correct, because the error of the estimated p is more important if we are near the decision threshold (which is the significance level).

In this case one wants to know k where $\sqrt{\frac{\alpha(1-\alpha)}{k}}\le P$.

<=> $\frac{(\alpha(1-\alpha))}{P^2}\le k$

Note that if the true unkown p-value is clearly bigger than $\alpha$, then the error is actually bigger, so p in $[p-3*P,p+3*P]$ does not hold anymore.

Extending the confidence interval

This approach corresponds with the center of the confidence interval being right at the decision threshold. In order to force the upper bound of the confidence interval of the estimated p being below the decision threshold (which is more correct), one needs ...

$l\sqrt{\frac{\alpha(1-\alpha)}{k}}\le P$

<=> $(l)^2\frac{(\alpha(1-\alpha))}{P^2}\le k$

where l corresponds to (see again the graphic)

| l | confidence interval |
| 1 | ~68 % |
| 2 | ~95 % |
| 3 | ~99 % |

Examples: Let the desired precison P be 0.005.

Then using the rough upper bound one gets $k>=10000$.

Using P at $\alpha=0.05$ and requesting a 95%-confidence interval one gets $k>=7600$.

For P=0.01 at $\alpha=0.01$ and a 95 % confidence interval one gets k>=396.

Finally: I strongly suggest to dive deeper into Monte-Carlo simulations. The wikipedia provides a start.

Related Solutions

Binomial Distribution – How to Determine Number of Trials Required to Achieve Desired Odds

I wonder what it means to be "99% sure."

The code seems to equate "dumb luck" with $p$ = 52.8% probability of wins. Let's imagine conducting $N$ trials, during which we observe $k$ wins. Suppose, for instance, $N$ = 1000 and you observe $k$ = 530 wins. That's greater than the expected number $p N$ = 528, but it's so close that a sceptic would argue it could be due to chance: "dumb luck." So, you need a test. The best test uses a threshold $t$ (which depends, obviously, on $N$ and $p$ and will be a bit greater than $N p$). The chance--still assuming $p$ = 52.8%--of having more than $t$ wins needs to be so low that if you do observe more than $t$ wins, both you and a reasonable sceptic will allow that $p$ is probably greater than 52.8%. This chance is usually called $\alpha$, the "false positive rate." Your first task is to set this threshold. How will you do it? That is, what value of $\alpha$ do you want to use? Does "99% sure" mean that $1 - \alpha$ should equal 99%?

The second task is to consider your chances of beating the threshold if in fact $p$ equals 55% (or whatever you choose). Does "99% sure" mean you want to have a 99% chance of having more than $t$ wins in $N$ trials when $p$ = 55%?

So, it seems to me there are two chances involved here: one in determining the test to distinguish "dumb luck" and another one that will subsequently determine how many trials you need. If this analysis correctly reflects the situation, the problem is easily solved. Note a few things:

There are plenty of uncertainties about this procedure and some of them are not captured by the statistical model. Therefore we shouldn't aim for too much precision. There's really no difference between, say, 2742 and 2749 trials.
In case it turns out hundreds or thousands of trials are needed, the (usual) Normal approximation to the Binomial distribution will be highly accurate. Point #1 justifies using an approximation.
Consequently, there is a simple formula to compute the number of trials directly: no need to explore hundreds of possibilities!

The formula alluded to in #3 is this: in $N$ trials with probability $p$ of winning, the standard error of the proportion of wins is

$$s = \sqrt{p(1-p)/N}.$$

The threshold therefore should be

$$t(N) = N(p + z_\alpha s)$$

where $z_\alpha$ is the upper $100 - 100\alpha$ percentile of the standard normal distribution. If the true probability of a win actually is $q$ (such as $q$ = 55%), the chance of exceeding $t(N)$ in $N$ trials equals

$$\Phi\left(\frac{(q-t(N)/N)\sqrt{N}}{\sqrt{q(1-q)}}\right) = \Phi\left(\frac{(q-p)\sqrt{N} - z_\alpha \sqrt{p(1-p)}}{\sqrt{q(1-q)}}\right) $$

where $\Phi$ denotes the standard normal cumulative distribution function. Equating this with your desired degree of certainty (such as $\beta$ = 99%) and solving this equation for $N$ (rounding up to the nearest integer) gives the solution

$$N = \left(\frac{z_\alpha \sqrt{p(1-p)} + z_\beta \sqrt{q(1-q)}}{q-p}\right)^2$$

(assuming $q \gt p$).

Example

For $p$ = 52.8%, $q$ = 55%, $\alpha$ = 1%, and $\beta$ = 99%, which give $z_\alpha$ = $z_\beta$ = 2.32635, I obtain $N$ = 11109 with this formula. In 11109 trials you would expect to win about 5866 times by "dumb luck" and maybe as many as $t$ = 5988 times due to chance variation. If the true winning percentage is 55%, you expect to win about 6110 times, but there's a 1% chance you will win 5988 or fewer times, and thereby (due to "bad luck") not succeed in showing (with 99% confidence) that "dumb luck" isn't operating.

R code for the example

#
# Specify inputs.
#
p <- 52.8/100     # Null hypothesis
q <- 55/100       # Alternative
alpha <- 1/100    # Test size
beta <- 99/100    # Chance of confirming the alternative
#
# Compute number of trials.
#
s.p <- sqrt(p*(1-p))
s.q <- sqrt(q*(1-q))
z.alpha <- qnorm(1 - alpha)
z.beta <- qnorm(beta)
delta <- q - p
N <- ceiling(((z.alpha * s.p + z.beta * s.q)/delta)^2)
#
# Explain.
#
t <- floor(N*(p + z.alpha*se.p))   # Threshold
m <- round(N*p)                    # Expectation (null)
k <- round(N*q)                    # Expectation (alternate)
size <- pnorm((t + 1/2 - N*p)/(s.p * sqrt(N))) # Actual test size
sureness <- 1 - pnorm((N*q - (t + 1/2))/(s.q * sqrt(N)))
print(list(N=N, threshold=t, expectation=m, alternate=k,
  size=size, sureness=sureness))

Permutation Test – Randomisation/Permutation Test for Paired Vectors in R

Though I pointed in comments to the use of the coin package I think it's worth illustrating that a permutation/randomization test is really quite simple, so I have done it.

Here I write some R code to do a randomization test for a one sample test of location. The test randomly flips signs on the differences and computes the mean; this is equivalent to randomly assigning each pair of values to the x and y groups. The code below could be made significantly shorter (I could do it in two lines easily enough, or even one if you didn't mind slower code).

This code takes a few seconds on my machine:

# assumes the two samples are in 'x' and 'y' and x[i] and y[i] are paired
# set up:
B <- 99999
d <- x-y
m0 <- mean(d)

# perform a one-sample randomization test on d
# for the null hypothesis H0: mu_d = 0   vs H1 mu_d != 0  (i.e. two tailed)
# here the test statistic is the mean
rndmdist <- replicate(B,mean((rbinom(length(d),1,.5)*2-1)*d))

# two tailed p-value:
sum( abs(rndmdist) >= abs(m0))/length(rndmdist)

That's the whole thing.

Note that rbinom(length(d),1,.5)*2-1) gives a random -1 or 1 ... i.e. a random sign, so when we multiply by any set of signed d, it is equivalent to randomly assigning + or - signs to the absolute differences. [It doesn't matter what distribution of signs on d you start with, now the d will have random signs.]

Here, I compare it with a t-test on some made up data:

 set.seed(seed=438978)
 z=rnorm(50,10,2)
 x=z-rnorm(50,0,.5)
 y=z+.4+rnorm(50,0,.5)
 t.test(y-x) # gives p = 0.003156

 B <- 99999
 d <- x-y
 m0 <- mean(d)
 rndmdist <- replicate(B,mean((rbinom(length(d),1,.5)*2-1)*d))
 sum( abs(rndmdist) >= abs(m0))/length(rndmdist)

When the t-test is valid it usually gives a very similar p-value to the completely enumerated permutation test, and a simulated p-value as above (when the number of simulations is sufficiently large) will converge to that second p-value.

At the number of replications used above, a true permutation p-value (i.e. from complete enumeration) of 0.05 will be estimated to within 0.001 (that is, will give a randomization p-value between 0.049 and 0.051) about 85% of the time and to within 0.002 over 99.5% of the time.