Solved – How is the exact permutation test procedure carried out: iterating over permutations or using combinations of one group

combinatoricshypothesis testingp-valuepermutation-test

I've tried to find an article that explains the procedure of permutation tests for the exhaustive sampling of all permutations (not the monte carlo method) and couldn't find a resource that was specific enough to help me with the ambiguity outlined below. For example, the Wikipedia article (https://en.wikipedia.org/wiki/Resampling_(statistics)#Permutation_tests) says

enter image description here

For example, given a combined dataset (1, 2, 3) where group A has length 2 and group B has length 1 for simplicity, it is not clear to me whether "all possible ways to divide it" means {(1, 2), (3)} and {(2, 1), (3), …} or if they count "{(1, 2), (3)}" and "{(2, 1), (3)}" as the same division.

Looking at various code examples, for example, the Python, R, Julia, etc. examples on https://rosettacode.org/wiki/Permutation_test, I see that the permutation test is usually implemented as follows:


Given two samples A and B

  1. Record the test statistic (e.g., $|\bar{A} – \bar{B}|$)
  2. Combine samples A and B into one large sample AB
  3. for combination of length(A) from AB:
    3a) compute permutation statistic (e.g., $|\bar{A'} – \bar{B'}|$, where A' is the combination from 3. and B' are all the samples in AB that are not in A').
    3b) record permutation statistic
  4. Compute p-value as the proportion the permutation statistic from 3a) was more extreme than the test statistic from 1. divided by the number of combinations sampled

However, shouldn't we be sampling the permutations instead of combinations of length A? For example, as outlined below (I highlighted the difference to the previous procedure in bold):


Given two samples A and B

  1. Record the test statistic (e.g., $|\bar{A} – \bar{B}|$)
  2. Combine samples A and B into one large sample AB
  3. for permutation of length(AB) from AB:
    3a) compute permutation statistic (e.g., $|\bar{A'} – \bar{B'}|$, where A' are the first len(A) samples in the permuted AB sequence and B' are the remaining samples in AB

  4. Compute p-value as the proportion the permutation statistic from 3a) was more extreme than the test statistic from 1. divided by the number of permutations sampled


Or, to provide a simple numeric example, consider the following 2 samples

a = [1, 3]
b = [2]

with the observed difference:
obs = |mean(a) – mean(b)| = 2

Using the "combinations" procedure, we would be sampling the following:

(1, 2), (3) => diff 0
(1, 3), (2) => diff 2
(2, 3), (1) => diff 4

where in 2 out of 3 cases, we would observe a difference equal or more extreme than in the observed statistic (i.e., p=2/3)

Now, using permutations, we would get the following:

(1, 2), (3) => diff 0
(1, 3), (2) => diff 2
(2, 1), (3) => diff 0
(2, 3), (1) => diff 4
(3, 1), (2) => diff 2
(3, 2), (1) => diff 4

Here, we observe a difference that is equal or more extreme than the observed statistic in 4 out of 6 cases (p=4/6)

Does anyone no more about the exact procedure and has a reliable resource at hand? Thanks!

Best Answer

Whether or not you take the combinations or permutations doesn't actually affect your results, as the number of permutations of $n_{A}$ specific objects in $A$ and $n_{B}$ specific objects in $B$ is the same for all combinations of $x_{1} ... x_{n_{A}}$ and $x_{n_{A+1}} ... x_{n_{A} + n_{B}}$ since the size of each set doesn't change.

That is, for each and any given combination, you will get $n_{A}! \times n_{B}!$ times as many permutations than combinations regardless of the values inside each set. And as the value of the result (the difference between group means) does not change between permutations of the same combination the frequency of each specific result will be scaled equally when taking the permutation. So when calculating the quantiles practically it makes no difference using combinations or permutations. In fact you empirically proved it for the case of $n_{A} = 1$ and $n_{b} = 2$, the frequency of each result, $D = 0,2,4$, is just scaled by $2$ when taking permutations resulting in the quantile values being the same.

Let's assume the standard scenario where samples are independent, and we want to test if two samples come from the same distribution (null hypothesis) based on the difference in sample means

To be technical if you want to test this specific hypothesis I think it is more strictly "correct" to take the complete set of permutations (not combinations) of each set, as the distribution assumption under the null that group labels don't matter, is essentially allowing each $x_{i.}$ to take every value in the presence of every other $x_{j \neq i.}$, which combinations do not allow for.

But again, the results of the quantiles for the empirical distribution are the same since the frequency of each result is just scaled by the same amount $n_{A}! \times n_{B}!$, so practically it doesn't matter.

Related Question