Confidence Interval – Calculating Interval from Significant p Value in Kruskal Wallis Test

confidence intervalkruskal-wallis test”nonparametric

I have run the nonparametric Kruskal Wallis test with pairwise comparisons in SPSS to understand if my 3 groups of survey participants were different in their responses to a range of 5 point ordinal scale questions (data is not normally distributed).

Some are responses significant at p < 0.05, some are not, as expected and the results fit with our what we expected to find. A particular journal I would like to submit to requires I report 95% confidence intervals with any p values reported, is this possible? It doesn't make sense to me as my understanding of the KW test was it uses ranks of the median. Thanks in advance.

To follow up and provide further information:
I'm now running the Wilcoxon tests in R, this is anexample of an output where p<0.0167 but the confidence interval includes 0 as the CI itself is so small:

wilcox.test(Data$Max[Data$Cluster==1], Data$Max[Data$Cluster==3],conf.int=T,conf.lev=.983)

    Wilcoxon rank sum test with continuity correction

data: Data$Max[Data$Cluster == 1] and Data$Max[Data$Cluster == 3]
W = 19368, p-value = 0.01348
alternative hypothesis: true location shift is not equal to 0
98.3 percent confidence interval:
-2.390968e-05 4.505737e-05
sample estimates:
difference in location
6.994035e-06

The data are very skewed! Here is the table of my groups (cluster 1-3) x answers to the question "Max" 1 = Strongly oppose through to 5 = Strongly support

table(Data$Cluster, Data$Max)

  1   2   3   4   5  

1 1 6 12 38 210
2 0 6 11 34 103
3 0 4 10 29 87

Best Answer

My interpretation of this is as follows. If I am wrong, please give the kind of clarification suggested by @whuber, perhaps along with some sample data to illustrate what you are doing.

If the Kruskal-Wallis test rejects the null hypothesis that the three medians are all equal, then you will use two-sample Wilcoxon tests to do multiple comparisons A vs B, B vs C and A vs C. In order to control the overall error rate for the three comparisons you might use the Bonferroni significance level $.05/3 = .0167$ for the comparisons.

Recently, some psychology and sociology journals have blamed irreproducibility of certain results on abuse of P-values, and ask for confidence intervals (CIs) in addition to or instead of P-values. (I'm not saying they are correct to deprecate P-values or that asking for CIs always makes sense, just stating what I have observed and heard.)

You might give $(100 - 1.67)\% = 98.3\%$ CIs for the differences in medians. Presumably, these could be CIs produced by Wilcoxon test procedures. A difficulty may be that a 5-point ordinal scale might produce some ties, but perhaps the approximate CIs given in spite of that would be useful.

I doubt that the journal is asking for a CI for the overall Kruskal-Wallis test, but if so, perhaps use @jbowman's suggestion.

In the tentative exploration (in R) below I use fake simulated data for groups A, B, and C. There are $n = 50$ responses in each group, summarized as follows:

summary(A)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.190   2.212   2.935   2.817   3.280   4.700 
summary(B)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.350   2.695   3.450   3.331   4.107   4.690 
summary(C)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.620   3.328   3.800   3.672   4.228   4.690 

Concatenating the data to the vector X and making a group variable gp, we have the following notched boxplot. Notches in the sides of the boxes are approximate nonparametric CIs for individual group medians, calibrated so that two non-overlapping CIs indicate a significant difference. Roughly, it seems that A and B may differ significantly, that B and C clearly do not, and that A and C are obviously significantly different.

enter image description here

 kruskal.test(X ~ gp)

             Kruskal-Wallis rank sum test

     data:  X by gp
     Kruskal-Wallis chi-squared = 17.887, df = 2, p-value = 0.0001306

So there is no doubt that the groups vary. Now we do three 2-sample Wilcoxon tests. Remember that we are looking for P-values below .0167 in order to declare significant differences.

wilcox.test(A, B, conf.int=T, conf.lev=.983)

    Wilcoxon rank sum test with continuity correction

data:  A and B
W = 825, p-value = 0.003428
alternative hypothesis: true location shift is not equal to 0
98.3 percent confidence interval:
 -1.0200705 -0.1199649
sample estimates:
difference in location 
            -0.5900226 

.

wilcox.test(B, C, conf.int=T, conf.lev=.983)

        Wilcoxon rank sum test with continuity correction

data:  B and C
W = 984, p-value = 0.06719
alternative hypothesis: true location shift is not equal to 0
98.3 percent confidence interval:
 -0.72004953  0.09001254
sample estimates:
difference in location 
            -0.3000609 

.

wilcox.test(A, C, conf.int=T, conf.lev=.983)

        Wilcoxon rank sum test with continuity correction

data:  A and C
W = 537.5, p-value = 9.175e-07
alternative hypothesis: true location shift is not equal to 0
98.3 percent confidence interval:
 -1.3099358 -0.5300047
sample estimates:
difference in location 
            -0.9199606 

Summarizing, we see that A and B are significantly different according to the Bonferroni criterion [CI $(-1.02, -0.12)$]; B and C are not significantly different [CI includes 0]; A and C are highly significantly different [CI $(-1.31 -0.53)].$

Note: Data were simulated as follows:

set.seed(918); n = 50
A = 1+round(4*rbeta(n, 2, 2),2)
B = 1+round(4*rbeta(n, 3, 2),2)
C = 1+round(4*rbeta(n, 3.5, 2),2)
X = c(A,B,C);  gp=rep(1:3, each=n)