statistical-significance – Strange Result in Post-Hoc Test: Causes and Solutions

kruskal-wallis test”post-hocstatistical significance

I have data for a test on three groups. The measured variable is ratio scaled. The R code is

g1a<-c(7, 3, 40)
g2a<-c(1,1,2)
g3a<-c(0,0,0)

Since the sample is small and normality cannot be guaranteed, I run a Kruskal Wallis test to check for significance:

l<-list(g1a,g2a,g3a)
kruskal.test(l)

The p-value is 0.02336, which is nice.

Now I run a post-hoc test, using the Mann-Whitney U:

wilcox.test(g1a,g2a,paired=FALSE,exact=TRUE)
wilcox.test(g2a,g3a,paired=FALSE,exact=TRUE)
wilcox.test(g1a,g3a,paired=FALSE,exact=TRUE)

All the resulting p-values are above 0.05 (0.07652, 0.0636, 0.05935). This is very strange. Shouldn't one of these tests give a much lower p-value? Especially since I'd have to use some sort of correction to account for the multiple comparisons in the post-hoc test. In other words: how can I interpret this result?

Best Answer

Think of it this way - overall, there's a significant difference, but it's a little hard to say exactly which two are significantly different. Alternatively, consider the chances of having three p-values less than 0.1 (even though they aren't independent of each other) - pretty small, right? So, again overall, we might suspect something significant is in the data, without being able to tell exactly where.

Your small sample sizes don't help; they mean the powers of your tests are very low, and also severely constrain what sort of p-values you can get, as the following example shows:

> g1a <- rnorm(3,0,1)
> g2a <- rnorm(3,2.5,1)
> g3a <- rnorm(3,5,1)
> 
> y <- list(g1a,g2a,g3a)
> y
[[1]]
[1] -2.31356435 -0.09903136 -0.42037052

[[2]]
[1] 2.806082 2.799857 3.383844

[[3]]
[1] 6.543636 6.845559 4.838341

> kruskal.test(y)

    Kruskal-Wallis rank sum test

data:  y 
Kruskal-Wallis chi-squared = 7.2, df = 2, p-value = 0.02732

So far, so good. On to the three Wilcoxon tests:

> wilcox.test(g1a,g2a,paired=FALSE,exact=TRUE)

    Wilcoxon rank sum test

data:  g1a and g2a 
W = 0, p-value = 0.1
alternative hypothesis: true location shift is not equal to 0 

> wilcox.test(g2a,g3a,paired=FALSE,exact=TRUE)

    Wilcoxon rank sum test

data:  g2a and g3a 
W = 0, p-value = 0.1
alternative hypothesis: true location shift is not equal to 0 

> wilcox.test(g1a,g3a,paired=FALSE,exact=TRUE)

    Wilcoxon rank sum test

data:  g1a and g3a 
W = 0, p-value = 0.1
alternative hypothesis: true location shift is not equal to 0 

All three p-values at 0.1, but we can't get more extreme - W = 0 - so evidently we've hit a sample size imposed limit on p-values.