Solved – Adjusting for multiple Kruskal-Wallis tests

kruskal-wallis test”multiple-comparisonsr

I've run 3 different machine learning algorithms on 10 different datasets, generating an accuracy on each one. My hypothesis is that two of the algorithms are consistently better than the third. I've noticed that the accuracies aren't normally distributed and so I'm looking to use non-parametric tests.

My initial idea on how to assess any difference is to run Kruskal-Wallis on each dataset, to see if there is a significant difference in the accuracies from each algorithm. As I'd be running K-W 10 times, would I need to account for this with a multiple comparison correction method?

If on any of the datasets I get a significant result, I'd run a post-hoc analysis. From what I've seen there aren't many simple (in R) non-parametric post-hoc techniques, and so I would run 3 pairwise Mann-Whitney U-tests between each algorithm's scores.

My questions are:

  1. Is running 10 K-W tests the correct approach for the first part? If so would I need to correct for multiple tests?
  2. Is my post-hoc analysis a good approach?

Best Answer

Let's step back and look at what the data would look like. From what you describe, 3 algorithms (i.e. groups or treatments) and 10 datasets (i.e. subjects). In this case, you have a a within-subjects design (i.e. repeated measures) with one factor. One way to represent this is like this:

set.seed(123)
df <- data.frame(dataset = rep(seq(10), 3), 
                 algorithm = rep(c("ML1","ML2","ML3"), each=10), 
                 Accuracy = runif(30))
> df
   dataset algorithm   Accuracy
1        1       ML1 0.28757752
2        2       ML1 0.78830514
3        3       ML1 0.40897692
4        4       ML1 0.88301740
5        5       ML1 0.94046728
6        6       ML1 0.04555650
7        7       ML1 0.52810549
8        8       ML1 0.89241904
9        9       ML1 0.55143501
10      10       ML1 0.45661474
11       1       ML2 0.95683335
12       2       ML2 0.45333416
13       3       ML2 0.67757064
14       4       ML2 0.57263340
15       5       ML2 0.10292468
16       6       ML2 0.89982497
17       7       ML2 0.24608773
18       8       ML2 0.04205953
19       9       ML2 0.32792072
20      10       ML2 0.95450365
21       1       ML3 0.88953932
22       2       ML3 0.69280341
23       3       ML3 0.64050681
24       4       ML3 0.99426978
25       5       ML3 0.65570580
26       6       ML3 0.70853047
27       7       ML3 0.54406602
28       8       ML3 0.59414202
29       9       ML3 0.28915974
30      10       ML3 0.14711365

You will typically see examples that have 'subject' as a label. In your case, your 'subjects' are 'datasets'. If you can assume normality, you would do repeated-measures ANOVA. However, you state you know the accuracies are not normally distributed and you naturally want a non-parametric method. Your dataset is also balanced (10 samples/group) so we can use the Friedman test (which essentially is a nonparametric repeated-measures ANOVA).

If you get a significant p-value from the test, you would do post-hoc analysis with a pairwise paired Wilcoxon test with some sort of correction (e.g. bonferroni, holm, etc.). You would not use Mann-Whitney because you have 'paired/repeated measures' data.

Lastly, you probably want the effect size any significant differences. This also would use the wilcoxon test. In R there is no function I can recall right now but the equation is very simple:

$$r=\frac{Z}{sqrt(N)}$$

Where Z is the Z-score and N is the sample size (between the two groups being compared). You can get this Z-score using the wilcoxsign_test from the coin package.

Using the above data, this can be done in R with the following. Please note, the above data was just randomly generated so there is no significance. This is just for demonstrating some code:

# Friedman Test
friedman.test(Accuracy ~ algorithm|dataset, data=df)

# Post-hoc tests with 'bonferroni correction'
with(df, pairwise.wilcox.test(Accuracy, algorithm, p.adj="bonferroni", paired=T))

# Get Z-score for calculating effect-size
library(coin)
with(df, wilcoxsign_test(Accuracy ~ factor(algorithm)|factor(dataset), 
                         data=df[algorithm == "ML1" | algorithm == "ML2",]))

# Calculate effect size, in this case Z = -0.2548, two groups is 20 datasets
0.2548/sqrt(20)