Solved – Negative H value in Kruskal Wallis test

kruskal-wallis test”

I've found exactly one source adressing this (and of course didn't save it). It said that in a Kruskal-Wallis this is a consequence of having a large sample with a lot of ties. Seen as I've got about 50,000 respondents and only an 11 point scale variable I'd say I qualify for both.
What that source didn't say however is how to treat this anomaly. At first I just treated is as if $p > 0.05$ and not significant. however when I loaded the wrong data in my post hoc analysis (the data with the negative H) a lot of the pairwise comparisons turned out to be significant. (even more so then some of the test where the H value had a $p < 0.001$)
So that made me wonder if I have to treat this negative H value differently. Should I just use a random subsample of data to see whether that has a significant H value or declare the test invalid and just see what happens with post hoc (the latter seems unlikely).
By the way my post hoc consists of a Bonferroni corrected Mann-Whitney U comparison.

Best Answer

The Kruskal-Wallis $H$ statistic is given by:

$$H=\frac{\frac{12\sum_{i=1}^{k}{n_{i}\left(\bar{R}_{i}-\bar{R}\right)^{2}}}{N\left(N+1\right)}}{1-\frac{\sum{T}}{N^{3}-N}}\text{, where:}$$

$k$ is the number of groups;
$N$ is the number of observations across all groups;
$n_{i}$ is the number of observations in the $i^{th}$ group;
$\bar{R}$ is the mean rank of all observations;
$\bar{R}_{i}$ is the rank sum of observations from the $i^{th}$ group (ranks are across observations from all groups); and
$T=t^{3}-t$ for each set of tied ranks, where $t$ is the number of ties in the set, and $\sum{T}$ is the sum of this quantity across all sets of tied ranks.

When there are no ties $T=0$, the denominator of $H$ simplifies to $1$.

For $N=50,000$ and a uniform distribution of ties across your eleven possible values the denominator of $H$ is approximately:

$$1-\frac{11\left(4545^3-45\right)}{50000^3-50000} \approx 0.9997$$

Assuming a highly skewed distribution of ties—say all but ten observations tied on a single value—the denominator of $H$ is approximately:

$$1-\frac{\left(49,990^3-49,990\right)}{50000^3-50000} \approx 0.0006$$

The most extreme case would be where all $N$ observations were tied on the same value, in which case the denominator of $H$ would simplify to $0$, and $H$ would thus be undefined.

Because the cubed term in $T$ can never be greater than $N^{3}$, I do not think it is possible to obtain a negative value of the denominator, and therefore not possible to obtain a negative value of $H$.

Conclusion:

It is not possible to obtain a negative value of $H$ by adjusting for ties using Kruskal & Wallis formula for $H$ (Equation 1.2) and their adjustment for ties (Equation 1.3).
Cubing a large $N$ might place one's software in the position of trying to calculate beyond its available precision, and numerical inconsistencies might thus result.

Kruskal, W. H. and Wallis, A. (1952). Use of ranks in one-criterion variance analysis. Journal of the American Statistical Association, 47(260):583–621.

Related Solutions

Solved – Adjusting for multiple Kruskal-Wallis tests

Let's step back and look at what the data would look like. From what you describe, 3 algorithms (i.e. groups or treatments) and 10 datasets (i.e. subjects). In this case, you have a a within-subjects design (i.e. repeated measures) with one factor. One way to represent this is like this:

set.seed(123)
df <- data.frame(dataset = rep(seq(10), 3), 
                 algorithm = rep(c("ML1","ML2","ML3"), each=10), 
                 Accuracy = runif(30))
> df
   dataset algorithm   Accuracy
1        1       ML1 0.28757752
2        2       ML1 0.78830514
3        3       ML1 0.40897692
4        4       ML1 0.88301740
5        5       ML1 0.94046728
6        6       ML1 0.04555650
7        7       ML1 0.52810549
8        8       ML1 0.89241904
9        9       ML1 0.55143501
10      10       ML1 0.45661474
11       1       ML2 0.95683335
12       2       ML2 0.45333416
13       3       ML2 0.67757064
14       4       ML2 0.57263340
15       5       ML2 0.10292468
16       6       ML2 0.89982497
17       7       ML2 0.24608773
18       8       ML2 0.04205953
19       9       ML2 0.32792072
20      10       ML2 0.95450365
21       1       ML3 0.88953932
22       2       ML3 0.69280341
23       3       ML3 0.64050681
24       4       ML3 0.99426978
25       5       ML3 0.65570580
26       6       ML3 0.70853047
27       7       ML3 0.54406602
28       8       ML3 0.59414202
29       9       ML3 0.28915974
30      10       ML3 0.14711365

You will typically see examples that have 'subject' as a label. In your case, your 'subjects' are 'datasets'. If you can assume normality, you would do repeated-measures ANOVA. However, you state you know the accuracies are not normally distributed and you naturally want a non-parametric method. Your dataset is also balanced (10 samples/group) so we can use the Friedman test (which essentially is a nonparametric repeated-measures ANOVA).

If you get a significant p-value from the test, you would do post-hoc analysis with a pairwise paired Wilcoxon test with some sort of correction (e.g. bonferroni, holm, etc.). You would not use Mann-Whitney because you have 'paired/repeated measures' data.

Lastly, you probably want the effect size any significant differences. This also would use the wilcoxon test. In R there is no function I can recall right now but the equation is very simple:

$$r=\frac{Z}{sqrt(N)}$$

Where Z is the Z-score and N is the sample size (between the two groups being compared). You can get this Z-score using the wilcoxsign_test from the coin package.

Using the above data, this can be done in R with the following. Please note, the above data was just randomly generated so there is no significance. This is just for demonstrating some code:

# Friedman Test
friedman.test(Accuracy ~ algorithm|dataset, data=df)

# Post-hoc tests with 'bonferroni correction'
with(df, pairwise.wilcox.test(Accuracy, algorithm, p.adj="bonferroni", paired=T))

# Get Z-score for calculating effect-size
library(coin)
with(df, wilcoxsign_test(Accuracy ~ factor(algorithm)|factor(dataset), 
                         data=df[algorithm == "ML1" | algorithm == "ML2",]))

# Calculate effect size, in this case Z = -0.2548, two groups is 20 datasets
0.2548/sqrt(20)

Solved – Kruskal Wallis or MANOVA

For 8 input variables and 8 outcome variables, you need multivariate multiple regression or MANCOVA.

MANOVA is used in case of one input and multiple outcomes.

Best Answer

Related Solutions

Solved – Adjusting for multiple Kruskal-Wallis tests

Solved – Kruskal Wallis or MANOVA

Related Question