It shouldn't matter that much since the test statistic will always be the difference in means (or something equivalent). Small differences can come from the implementation of Monte-Carlo methods. Trying the three packages with your data with a one-sided test for two independent variables:
DV <- c(x1, y1)
IV <- factor(rep(c("A", "B"), c(length(x1), length(y1))))
library(coin) # for oneway_test(), pvalue()
pvalue(oneway_test(DV ~ IV, alternative="greater",
distribution=approximate(B=9999)))
[1] 0.00330033
library(perm) # for permTS()
permTS(DV ~ IV, alternative="greater", method="exact.mc",
control=permControl(nmc=10^4-1))$p.value
[1] 0.003
library(exactRankTests) # for perm.test()
perm.test(DV ~ IV, paired=FALSE, alternative="greater", exact=TRUE)$p.value
[1] 0.003171822
To check the exact p-value with a manual calculation of all permutations, I'll restrict the data to the first 9 values.
x1 <- x1[1:9]
y1 <- y1[1:9]
DV <- c(x1, y1)
IV <- factor(rep(c("A", "B"), c(length(x1), length(y1))))
pvalue(oneway_test(DV ~ IV, alternative="greater", distribution="exact"))
[1] 0.0945907
permTS(DV ~ IV, alternative="greater", exact=TRUE)$p.value
[1] 0.0945907
# perm.test() gives different result due to rounding of input values
perm.test(DV ~ IV, paired=FALSE, alternative="greater", exact=TRUE)$p.value
[1] 0.1029412
# manual exact permutation test
idx <- seq(along=DV) # indices to permute
idxA <- combn(idx, length(x1)) # all possibilities for different groups
# function to calculate difference in group means given index vector for group A
getDiffM <- function(x) { mean(DV[x]) - mean(DV[!(idx %in% x)]) }
resDM <- apply(idxA, 2, getDiffM) # difference in means for all permutations
diffM <- mean(x1) - mean(y1) # empirical differencen in group means
# p-value: proportion of group means at least as extreme as observed one
(pVal <- sum(resDM >= diffM) / length(resDM))
[1] 0.0945907
coin
and exactRankTests
are both from the same author, but coin
seems to be more general and extensive - also in terms of documentation. exactRankTests
is not actively developed anymore. I'd therefore choose coin
(also because of informative functions like support()
), unless you don't like to deal with S4 objects.
EDIT: for two dependent variables, the syntax is
id <- factor(rep(1:length(x1), 2)) # factor for participant
pvalue(oneway_test(DV ~ IV | id, alternative="greater",
distribution=approximate(B=9999)))
[1] 0.00810081
Your call for this test:
wilcox_test(y ~ condition | participant, data = data, alternative = "greater",
distribution = "approximate")
produces not a signed rank test (as the other calls do) but a rank sum test. It's a different thing for a different situation (independent samples, not paired). It tells you this in the output:
## Approximative Wilcoxon Mann-Whitney Rank Sum Test
So there's no way that should give a similar p-value to the other cases.
from your comment:
which indicates that it should handle handle pairwise comparisons as well.
It's possible that's a correct conclusion, but - whatever is needed to make it do that - you clearly didn't achieve it, as the output made clear: you didn't get a signed rank test.
The two exact signed rank tests did produce the same p-value, as one would hope.
With extremely small p-values you should not expect approximate methods to be highly accurate (close to the exact test values). They do all lead one to reject at any sensible significance level, which is about as much as you can ask as far as consistency goes.
As for differences between them, you'd have to look to exactly what has been implemented for each test - what approximations are made, what assumptions, and so on.
The last p-value (the pairwise comparison) doesn't seem to be a one-tailed test so it's hardly surprising it's about twice as large as the two above it.
Best Answer
I doubt it, since it's not necessary -- we can compute the mean and variance without knowing the exact distribution, even in the presence of ties. But you could only tell with absolute certainty what the code does by actually looking at it.
No, the z value should be exact at any sample size (in the sense that it's a correct computation of the number of standard deviations the test statistic is from the mean under the null), but the resulting quantity doesn't have an exact normal distribution (under the null), so if you computed a p-value from that z, then the p-value would be approximate.
Sure --
The test statistic isn't exactly normally distributed.
If they're using sampling of the permutation distribution rather than exact calculations, there's also some simulation error.