[There seem to be a lot of similar questions here, so please point me the right direction if this has already been answered, but I think it's reasonably differentiated.]
There are many different implementations of post-hoc analyses following a Kruskal-Wallis test. I'm trying to understand how (why?) they differ, to get a sense of when one might be the right choice over another.
Working in R, consider this simulated dataset
generate.sim.data<-function(seed){
set.seed(seed)
sim1<-rnorm(20,4,3)
sim2<-rnorm(20,7,3)
sim3<-rnorm(20,1,3)
sim4<-rnorm(20,1,3)
simdata<-c(sim1,sim2,sim3,sim4)
simgroup<-c(rep(c("sim1","sim2","sim3","sim4"),each=20))
data.frame(simdata,simgroup)
}
The functions kruskal
in the package agricolae
; kruskalmc
in the package pgirmess
, posthoc.kruskal.nemenyi.test
in the package PMCMR
, and dunn.test
in the package dunn.test
all give different statistics (for any input). For certain values, they also give varying results in pairwise comparisons
sim<-generate.sim.data(123)
kruskal(sim$simdata,sim$simgroup,console=T) #a,b,c,c
kruskalmc(sim$simdata,sim$simgroup) #a,a,b,b
posthoc.kruskal.nemenyi.test(sim$simdata,sim$simgroup) #a,a,b,b
dunn.test(sim$simdata,sim$simgroup) #a,b,c,c
but agree in some more clearcut cases:
sim<-generate.sim.data(321)
kruskal(sim$simdata,sim$simgroup,console=T) #a,a,b,b
kruskalmc(sim$simdata,sim$simgroup) #a,a,b,b
posthoc.kruskal.nemenyi.test(sim$simdata,sim$simgroup) #a,a,b,b
dunn.test(sim$simdata,sim$simgroup) #a,a,b,b
It seems that kruskalmc
and posthoc.kruskal.nemenyi.test
give similar results no matter what, and kruskal
and dunn.test
tend to give similar results, but this latter is not always the case, e.g.
sim<-generate.sim.data(4444)
kruskal(sim$simdata,sim$simgroup,console=T) #a,b,c,c
kruskalmc(sim$simdata,sim$simgroup) #ac,a,b,bc
posthoc.kruskal.nemenyi.test(sim$simdata,sim$simgroup) #ac,a,b,bc
dunn.test(sim$simdata,sim$simgroup) #ac,b,c,c
I realize I'm quibbling about some different behavior of the tests based on p-values very close to 0.05, but these tests also do give different diagnoses for real data sets (e.g., observation~method
from the corn
data in agricolae
; occupation~eligibility
from the homecare
data in dunn.test
). I wondered what the underlying differences are between the tests, and whether there's a reasonable criterion to choose one over another.
Best Answer
Understanding how these test implementations differ requires understanding the actual test statistics themselves.
For example,
dunn.test
provides Dunn's (1964) z test approximation to a rank sum test employing both the same ranks used in the Kruskal-Wallis test, and the pooled variance estimate implied by the null hypothesis of the Kruskal-Wallis (akin to using the pooled variance to calculate t test statistics following an ANOVA).By contrast, the Kruskal-Nemenyi test as implemented in
posthoc.kruskal.nemenyi.test
is based on either the Studentized range distribution, or the $\chi^{2}$ distribution depending on user choice.The
kruskalmc
function in thepgirmess
package implements Dunn's post hoc rank sum comparison using z test statistics as directed by Siegel and Castellan (1988), but these authors do not include Dunn's (1964) correction for ties, sokruskalmc
will be less accurate thandunn.test
when ties exist in the data.It is difficult to discern from the documentation of
kruskal
whether the author is using the Conover-Iman t approximation to the distribution of rank sum differences (similar to Dunn's test, but requires that the Kruskal-Wallis be rejected, and is more powerful). A brief glance at the code does not immediately scream out Conover-Iman to me, however, it is quite possible that is an implementation of the test. More certainly implemented in R is theconover.test
package.The tl;dr: these all appear to be implementations of different test statistics or different forms of the same test statistic, so there is no reason to expect them to agree.
References
W. Jay Conover (1999) Practical Nonparametrics Statistics.
Conover, W. J. and Iman, R. L. (1979). On multiple-comparisons procedures. Technical Report LA-7677-MS, Los Alamos Scientific Laboratory.
Dunn, O. J. (1964). Multiple comparisons using rank sums. Technometrics, 6(3):241–252.
Siegel and Castellan (1988) Non parametric statistics for the behavioural sciences. MacGraw Hill Int., New York. pp 213-214