Permutation Test – Randomisation/Permutation Test for Paired Vectors in R

permutation-testrstatistical significance

I'm not an expert, so forgive me if some of the terminology is a little clumsy. Happy to provide more information where required.

I have two vectors of 50 paired numeric values in R. I want to perform a two-tailed randomisation or permutation test to determine whether their differences are due to chance or not.

A permutation test (also called a randomization test, re-randomization test, or an exact test) is a type of statistical significance test in which the distribution of the test statistic under the null hypothesis is obtained by calculating all possible values of the test statistic under rearrangements of the labels on the observed data points.

I want to do this type of test because I believe the distributions of the values in the vectors violate the assumptions of other tests such as the t-test (for example, many of the numeric values in the vector are 0).

The permtest function in the BHH2 library, almost does what I want, but it operates on all $2^{50}$ permutations, which will take too long. Instead, I want to estimate the p-value, by sampling a large number of the possible permutations. I had a look in the coin package, but nothing in there seems to do a permutation test with sampling from paired numeric vectors.

Some googling lead me to this email, which suggests that the reason I can't find a package to do it is that it's a one-liner in R. Unfortunately, I'm not experienced enough with R to be able to produce that one-liner.

Is there a package or method that will perform a two-tailed paired permutation test using only a sample of the permutation space?

If not, would somebody be able to share a short bit of R code to do it?

Best Answer

Though I pointed in comments to the use of the coin package I think it's worth illustrating that a permutation/randomization test is really quite simple, so I have done it.

Here I write some R code to do a randomization test for a one sample test of location. The test randomly flips signs on the differences and computes the mean; this is equivalent to randomly assigning each pair of values to the x and y groups. The code below could be made significantly shorter (I could do it in two lines easily enough, or even one if you didn't mind slower code).

This code takes a few seconds on my machine:

# assumes the two samples are in 'x' and 'y' and x[i] and y[i] are paired
# set up:
B <- 99999
d <- x-y
m0 <- mean(d)

# perform a one-sample randomization test on d
# for the null hypothesis H0: mu_d = 0   vs H1 mu_d != 0  (i.e. two tailed)
# here the test statistic is the mean
rndmdist <- replicate(B,mean((rbinom(length(d),1,.5)*2-1)*d))

# two tailed p-value:
sum( abs(rndmdist) >= abs(m0))/length(rndmdist)

That's the whole thing.

Note that rbinom(length(d),1,.5)*2-1) gives a random -1 or 1 ... i.e. a random sign, so when we multiply by any set of signed d, it is equivalent to randomly assigning + or - signs to the absolute differences. [It doesn't matter what distribution of signs on d you start with, now the d will have random signs.]

Here, I compare it with a t-test on some made up data:

 set.seed(seed=438978)
 z=rnorm(50,10,2)
 x=z-rnorm(50,0,.5)
 y=z+.4+rnorm(50,0,.5)
 t.test(y-x) # gives p = 0.003156

 B <- 99999
 d <- x-y
 m0 <- mean(d)
 rndmdist <- replicate(B,mean((rbinom(length(d),1,.5)*2-1)*d))
 sum( abs(rndmdist) >= abs(m0))/length(rndmdist)

When the t-test is valid it usually gives a very similar p-value to the completely enumerated permutation test, and a simulated p-value as above (when the number of simulations is sufficiently large) will converge to that second p-value.

At the number of replications used above, a true permutation p-value (i.e. from complete enumeration) of 0.05 will be estimated to within 0.001 (that is, will give a randomization p-value between 0.049 and 0.051) about 85% of the time and to within 0.002 over 99.5% of the time.

Related Solutions

Permutation Test – Assistance with One-Sample Permutation Z-Test

I don't think it is going a one sample Z; to me it looks like it is a test against a certain set of priors.

I'm confused, why are you doing a one sample Z using binomial data as your source data? You could simply create a distribution of N successes and see what quantile your actual data was in. However the above method doesn't look like a permutation test per-se to me; as your code doesn't actually permute the observed values between 1 and 0.

That being said, let me comment on your code - to me it looks as though z is defined as

zeta = (mean(mn) - mu) / (sqrt(var/nx))
z[i] <- zeta

Thus, each score in Z is like a Z score of the randomly created binomial vector using the priors you've selected as the null hypothesis. But then you compare that Z to abs(mx); where mx is defined as the mean of your observed binomial vector. At the very least this looks like a problem to me. Z scores should be either compared to some other Z score or means should be compared to means.

As I alluded to above, it is odd that you'd put all of this under a structure of a Z-test. The Z score is nominally a linear transform of the differences between means, as such the result of a test like this should be the same whether you use a Z score or simply look at the differences between means.

Moreover, what you are doing seems a like an attempt to test the observed value against some priors rather than an actual one sample permutation test. What you want to test against is something like permbinom (code provided below) where it could be the case for each observed value that it either was a success or it was not a success. This is in-line with Tukey's classic example with the lady who claimed she could tell whether tea or milk was added first. Critically different from your test, the assumption of this permutation test is that the null hypothesis is fixed at p = .5.

permbinom <- function(x)
{
  newx <- x
    nx <- length(x) 
    change <- rbinom(n=nx,size=1,prob=.5) 
    #This code is readable but inefficient
    #Swap the values between 1 and 0 if change == 1
    for (i in 1:nx)
    {
        if ((change[i] == 1) & (x[i] == 1)) {newx[i] <- 0} 
        if ((change[i] == 1) & (x[i] == 0)) {newx[i] <- 1} 
    }
  return(newx)
}
permtest <- function(x,nsim=2000)
{
   permref <- rep(NA,nsim)
   obsn <- sum(x)
   for (i in 1:nsim)
     {
      permref[i] <- sum(permbinom(x))
     }
     pval <- min(c(
              (sum(permref > obsn)*2/nsim),
              (sum(permref < obsn)*2/nsim)
                ))
  return(pval)
}

I'm not 100% confident regarding how I'm calculating the p-value here; so if someone would kindly correct me if I'm doing it wrong I'll incorporate that as an edit.

For reference, here is a faster permutation function for one-sample tests of binomial data.

permbinomf <- function(x)
{
  newx <- x
    nx <- length(x) 
    change <- rbinom(n=nx,size=1,prob=.5) 
    #This code is readable but inefficient
    #Swap the values between 1 and 0 if change == 1
newx <- x + change
newx <- ifelse(newx==2,0,newx)
  return(newx)
}

Edit: The question is also put forth, "What about an arbitrary subset with a one-sample z-test?". That would also work, assuming you had a large enough sample to subset. However, it would not be a permutation test, it would be more akin to a bootstrap.

Edit 2: Perhaps the most important answer to your question, is this: You are doing something acceptable (if you fix the Z vs mean computational error noted above), but you aren't doing what you think you are doing. You are comparing your results to results where the null hypothesis is true. This is essentially a Monte-Carlo simulation and if you correct the math (and I suggest you also simplify it) it is an acceptable technique for testing your hypothesis. Also note, my answer above is for a two-tailed test. As noted in the other question, you are ignoring the nesting of binomial observations under participants but independence isn't an assumption in a permutation or monte-carlo test so you should be more or less fine. Though, as also noted there you ignore the possibility that some people are doing better than chance and others are performing at chance.

Solved – Permutation test for multiple correlation test statistics

Why not simulate some data that is the same structure as your data, but without any correlation/relationships, then use that (probably do this multiple times) to see how your strategy behaves. If you use the permutation test for each of the 93 variables then you will still have multiple comparison issues and are still likely to declare 4-5 correlations as significant when they really are not due to chance.

To correct for multiple comparisons you would need to do something more along the lines of combining all your correlation measures (probably transformed to be on some similar scale) and comparing the combined measure to the permutation values. Combinations to consider would be the maximum correlation (absolute value) or the mean correlation.

Something more along the lines of the FDR would be to compare your strongest correlation to the strongest from the permuted distribution, then compare the 2nd strongest to the 2nd strongest from the permutations, etc.

Having a mixture of different correlation measures will complicate this, but you could either analyze the groups separately, or convert everything to be on a similar scale (p-value would be one) so that they are comparable.

Best Answer

Related Solutions

Permutation Test – Assistance with One-Sample Permutation Z-Test

Solved – Permutation test for multiple correlation test statistics

Related Question