Solved – Confusion related to kruskal wallis test

kruskal-wallis test”MATLAB

I have some confusion related to Kruskal wallis test. I have an example lets say

X=[2 2 35 10 9 8 11 12];
Y=[1 1 1 2 2 2 2 2];

Y is the group variable

Now when I ran the kruskalwallis test

p = kruskalwallis(X,Y,'off')

I got p values of around 0.4. I was assuming the Kruskal wallis test takes the median. So it should have been robust when I added an outlier with value 35 in the third position. Why isn't it robust to that. Is it because I have very few samples. Can anyone explain?

Best Answer

If Y is meant to be a grouping variable, the p-value in R is around 0.45

> kruskal.test(x~y)

    Kruskal-Wallis rank sum test

data:  x by y 
Kruskal-Wallis chi-squared = 0.5622, df = 1, p-value = 0.4534

But it makes no difference whether that 35 is set to 13 or 35 or 1300 - the p-value is exactly the same. It is clearly robust to outliers.

With continuity correction, the p-value is somewhat higher.


Edit:

Here's an illustration of just how the Kruskal-Wallis p-value responds as you move the third observation around - that is, this is an empirical influence curve for the p-value as x[3] is moved (takes the various values of delta).

Kruskal-Wallis p-value as x[3] changes

We see that the Kruskal-Wallis is highly insensitive to all but a small range of values for x[3] (it is constant to the left of $[1,2]$ and constant to the right of it). It's really insensitive.

The grey line is the p-value with x[3] omitted. As you see, no value for x[3] will allow the Kruskal-Wallis to attain that p-value, though making x[3]=2 comes closest.


I was assuming the Kruskal wallis test takes the median.

It's a rank-based ANOVA. It doesn't actually 'use' the median for anything.

The measure of location-shift that corresponds to the Wilcoxon-Mann-Whitney (and hence to the Kruskal-Wallis) is the median of pairwise differences between the samples.

> median(outer(x[y==1],x[y==2],"-"))
[1] -7

Compare:

> wilcox.test(x~y,conf.int=TRUE)

    Wilcoxon rank sum test with continuity correction

data:  x by y 
W = 5, p-value = 0.5486
alternative hypothesis: true location shift is not equal to 0 
95 percent confidence interval:
 -10   5 
sample estimates:
difference in location 
             -6.999992    #<-------------------------------

(I'm not sure why it doesn't have better accuracy there)

If you change the 35 to 13 or 1300, you get the same estimate of shift.

If you add a whole new observation - if your original data in the first group was just (2, 2), then adding an additional observation changes the p-value. (This would be the case even if the median was the estimate of location shift.)

Related Question