Kruskal-Wallis Test – Difference Between Various Post-Hoc Tests in R

kruskal-wallis test”post-hocr

[There seem to be a lot of similar questions here, so please point me the right direction if this has already been answered, but I think it's reasonably differentiated.]

There are many different implementations of post-hoc analyses following a Kruskal-Wallis test. I'm trying to understand how (why?) they differ, to get a sense of when one might be the right choice over another.

Working in R, consider this simulated dataset

generate.sim.data<-function(seed){
 set.seed(seed)
 sim1<-rnorm(20,4,3)
 sim2<-rnorm(20,7,3)
 sim3<-rnorm(20,1,3)
 sim4<-rnorm(20,1,3)
 simdata<-c(sim1,sim2,sim3,sim4)
 simgroup<-c(rep(c("sim1","sim2","sim3","sim4"),each=20))
 data.frame(simdata,simgroup)
}

The functions kruskal in the package agricolae; kruskalmc in the package pgirmess, posthoc.kruskal.nemenyi.test in the package PMCMR, and dunn.test in the package dunn.test all give different statistics (for any input). For certain values, they also give varying results in pairwise comparisons

  sim<-generate.sim.data(123)
  kruskal(sim$simdata,sim$simgroup,console=T)               #a,b,c,c
  kruskalmc(sim$simdata,sim$simgroup)                       #a,a,b,b
  posthoc.kruskal.nemenyi.test(sim$simdata,sim$simgroup)    #a,a,b,b
  dunn.test(sim$simdata,sim$simgroup)                       #a,b,c,c

but agree in some more clearcut cases:

  sim<-generate.sim.data(321)
  kruskal(sim$simdata,sim$simgroup,console=T)               #a,a,b,b
  kruskalmc(sim$simdata,sim$simgroup)                       #a,a,b,b
  posthoc.kruskal.nemenyi.test(sim$simdata,sim$simgroup)    #a,a,b,b
  dunn.test(sim$simdata,sim$simgroup)                       #a,a,b,b

It seems that kruskalmc and posthoc.kruskal.nemenyi.test give similar results no matter what, and kruskal and dunn.test tend to give similar results, but this latter is not always the case, e.g.

  sim<-generate.sim.data(4444)
  kruskal(sim$simdata,sim$simgroup,console=T)               #a,b,c,c
  kruskalmc(sim$simdata,sim$simgroup)                       #ac,a,b,bc
  posthoc.kruskal.nemenyi.test(sim$simdata,sim$simgroup)    #ac,a,b,bc
  dunn.test(sim$simdata,sim$simgroup)                       #ac,b,c,c

I realize I'm quibbling about some different behavior of the tests based on p-values very close to 0.05, but these tests also do give different diagnoses for real data sets (e.g., observation~method from the corn data in agricolae; occupation~eligibility from the homecare data in dunn.test). I wondered what the underlying differences are between the tests, and whether there's a reasonable criterion to choose one over another.

Best Answer

Understanding how these test implementations differ requires understanding the actual test statistics themselves.

For example, dunn.test provides Dunn's (1964) z test approximation to a rank sum test employing both the same ranks used in the Kruskal-Wallis test, and the pooled variance estimate implied by the null hypothesis of the Kruskal-Wallis (akin to using the pooled variance to calculate t test statistics following an ANOVA).

By contrast, the Kruskal-Nemenyi test as implemented in posthoc.kruskal.nemenyi.test is based on either the Studentized range distribution, or the $\chi^{2}$ distribution depending on user choice.

The kruskalmc function in the pgirmess package implements Dunn's post hoc rank sum comparison using z test statistics as directed by Siegel and Castellan (1988), but these authors do not include Dunn's (1964) correction for ties, so kruskalmc will be less accurate than dunn.test when ties exist in the data.

It is difficult to discern from the documentation of kruskal whether the author is using the Conover-Iman t approximation to the distribution of rank sum differences (similar to Dunn's test, but requires that the Kruskal-Wallis be rejected, and is more powerful). A brief glance at the code does not immediately scream out Conover-Iman to me, however, it is quite possible that is an implementation of the test. More certainly implemented in R is the conover.test package.

The tl;dr: these all appear to be implementations of different test statistics or different forms of the same test statistic, so there is no reason to expect them to agree.

References

W. Jay Conover (1999) Practical Nonparametrics Statistics.

Conover, W. J. and Iman, R. L. (1979). On multiple-comparisons procedures. Technical Report LA-7677-MS, Los Alamos Scientific Laboratory.

Dunn, O. J. (1964). Multiple comparisons using rank sums. Technometrics, 6(3):241–252.

Siegel and Castellan (1988) Non parametric statistics for the behavioural sciences. MacGraw Hill Int., New York. pp 213-214