Though I pointed in comments to the use of the coin
package I think it's worth illustrating that a permutation/randomization test is really quite simple, so I have done it.
Here I write some R code to do a randomization test for a one sample test of location. The test randomly flips signs on the differences and computes the mean; this is equivalent to randomly assigning each pair of values to the x and y groups. The code below could be made significantly shorter (I could do it in two lines easily enough, or even one if you didn't mind slower code).
This code takes a few seconds on my machine:
# assumes the two samples are in 'x' and 'y' and x[i] and y[i] are paired
# set up:
B <- 99999
d <- x-y
m0 <- mean(d)
# perform a one-sample randomization test on d
# for the null hypothesis H0: mu_d = 0 vs H1 mu_d != 0 (i.e. two tailed)
# here the test statistic is the mean
rndmdist <- replicate(B,mean((rbinom(length(d),1,.5)*2-1)*d))
# two tailed p-value:
sum( abs(rndmdist) >= abs(m0))/length(rndmdist)
That's the whole thing.
Note that rbinom(length(d),1,.5)*2-1)
gives a random -1
or 1
... i.e. a random sign, so when we multiply by any set of signed d
, it is equivalent to randomly assigning +
or -
signs to the absolute differences. [It doesn't matter what distribution of signs on d
you start with, now the d
will have random signs.]
Here, I compare it with a t-test on some made up data:
set.seed(seed=438978)
z=rnorm(50,10,2)
x=z-rnorm(50,0,.5)
y=z+.4+rnorm(50,0,.5)
t.test(y-x) # gives p = 0.003156
B <- 99999
d <- x-y
m0 <- mean(d)
rndmdist <- replicate(B,mean((rbinom(length(d),1,.5)*2-1)*d))
sum( abs(rndmdist) >= abs(m0))/length(rndmdist)
When the t-test is valid it usually gives a very similar p-value to the completely enumerated permutation test, and a simulated p-value as above (when the number of simulations is sufficiently large) will converge to that second p-value.
At the number of replications used above, a true permutation p-value (i.e. from complete enumeration) of 0.05 will be estimated to within 0.001 (that is, will give a randomization p-value between 0.049 and 0.051) about 85% of the time and to within 0.002 over 99.5% of the time.
The formula uses the default notation for sets.
So $\{D′ ∈ \hat{D}: e(f,D′) ≤ e(f,D)\}$ is read as: A set containing all permutations $D'$ of $D$ in $\hat{D}$ restricted to those where $e(f,D′) ≤ e(f,D)$.
The pipes around the set definition ("$|$") do note the cardinality / size of the set.
The textual part above the formula contains the definition of k. $\hat{D}$ is the set of all created permutations and k is the size / cardinality of this set, so $k=|\hat{D}|$
Example: Suppose you create 4 permutations $D'_1,D'_2,D'_3,D'_4$ of a dataset $D$ and calculate $e$ for every permutation. Let
- $\hat{D}=\{D'_1,D'_2,D'_3,D'_4\}$
- $e(f,D)=0.1$
- $e(f,D'_1)=e(f,D'_2)=e(f,D'_3)=0.2$
- $e(f,D'_4)=0.05$
=> p = $\frac{1+1}{4+1}$
Best Answer
Here's one way you could generate an interval from a resampling test, though it's not always appropriate to consider it a confidence interval$^\dagger$. For a specific example, take a test for a two-sample difference in means. Consider shifting the second sample by $\delta$ (which can be positive or negative). Then the set of $\delta$ values which would lead to non-rejection by the test at level $\alpha$ could be used as a nominally $1-\alpha$ confidence interval for the difference in means.
$\dagger$ Some authors (e.g. [1], p364 et seq, [2]) call an interval constructed this way (parameter values not rejected by the test) a consonance interval -- which is a better name than confidence interval for it (though many people simply ignore the difference; for example I believe Cox & Hinkley call these confidence intervals) because the approach doesn't necessarily give intervals that have the desired coverage (in many situations it's possible to see that it should); the name conveys something about what the interval does tell you (an interval of values consistent with the data).
Gelman includes discussion of why it can sometimes be problematic to universally consider them confidence intervals here.
It's not hard to explore the coverage under particular sets of assumptions (via simulation), though, and there's no lack of people calling bootstrap intervals "confidence intervals" (even when they are sometimes seen to have nothing like the claimed coverage).
More details on how to do it in the two sample difference-in-means case are discussed in [3], where they're called randomization confidence intervals and a claim is made there about when they're exact (which claim I haven't tried to evaluate).
The estimated p-value is a straight binomial proportion. So it has the same standard error as any other binomial proportion, $\sqrt{\frac{p(1-p)}{n}}$.
So if $p = 0.05$ and $n=1000$, the standard error of the observed proportion is about $0.0069$. A $90\%$ CI would be $\pm 1.13\%$ [Alternatively, $\pm 1\%$ is about $1.45$ standard errors each side, which would correspond to a confidence interval for the underlying p-value of a bit over $85\%$]
So at least in a rough sense you could talk about the uncertainty being "about 1%"
--
[1] Kempthorne and Folks (1971),
Probability, Statistics, and data analysis,
Iowa State University Press
[2] LaMotte L.R. and Volaufová J, (1999),
"Prediction Intervals via Consonance Intervals",
Journal of the Royal Statistical Society. Series D (The Statistician), Vol. 48, No. 3, pp. 419-424
[3] Ernst, M.D. (2004),
"Permutation Methods: A Basis for Exact Inference", Statistical Science, Vol. 19, No. 4, 676–685