Solved – Negative H value in Kruskal Wallis test

kruskal-wallis test”

I've found exactly one source adressing this (and of course didn't save it). It said that in a Kruskal-Wallis this is a consequence of having a large sample with a lot of ties. Seen as I've got about 50,000 respondents and only an 11 point scale variable I'd say I qualify for both.
What that source didn't say however is how to treat this anomaly. At first I just treated is as if $p > 0.05$ and not significant. however when I loaded the wrong data in my post hoc analysis (the data with the negative H) a lot of the pairwise comparisons turned out to be significant. (even more so then some of the test where the H value had a $p < 0.001$)
So that made me wonder if I have to treat this negative H value differently. Should I just use a random subsample of data to see whether that has a significant H value or declare the test invalid and just see what happens with post hoc (the latter seems unlikely).
By the way my post hoc consists of a Bonferroni corrected Mann-Whitney U comparison.

Best Answer

The Kruskal-Wallis $H$ statistic is given by:

$$H=\frac{\frac{12\sum_{i=1}^{k}{n_{i}\left(\bar{R}_{i}-\bar{R}\right)^{2}}}{N\left(N+1\right)}}{1-\frac{\sum{T}}{N^{3}-N}}\text{, where:}$$

$k$ is the number of groups;
$N$ is the number of observations across all groups;
$n_{i}$ is the number of observations in the $i^{th}$ group;
$\bar{R}$ is the mean rank of all observations;
$\bar{R}_{i}$ is the rank sum of observations from the $i^{th}$ group (ranks are across observations from all groups); and
$T=t^{3}-t$ for each set of tied ranks, where $t$ is the number of ties in the set, and $\sum{T}$ is the sum of this quantity across all sets of tied ranks.

When there are no ties $T=0$, the denominator of $H$ simplifies to $1$.

For $N=50,000$ and a uniform distribution of ties across your eleven possible values the denominator of $H$ is approximately:

$$1-\frac{11\left(4545^3-45\right)}{50000^3-50000} \approx 0.9997$$

Assuming a highly skewed distribution of ties—say all but ten observations tied on a single value—the denominator of $H$ is approximately:

$$1-\frac{\left(49,990^3-49,990\right)}{50000^3-50000} \approx 0.0006$$

The most extreme case would be where all $N$ observations were tied on the same value, in which case the denominator of $H$ would simplify to $0$, and $H$ would thus be undefined.

Because the cubed term in $T$ can never be greater than $N^{3}$, I do not think it is possible to obtain a negative value of the denominator, and therefore not possible to obtain a negative value of $H$.

Conclusion:

  • It is not possible to obtain a negative value of $H$ by adjusting for ties using Kruskal & Wallis formula for $H$ (Equation 1.2) and their adjustment for ties (Equation 1.3).
  • Cubing a large $N$ might place one's software in the position of trying to calculate beyond its available precision, and numerical inconsistencies might thus result.

Kruskal, W. H. and Wallis, A. (1952). Use of ranks in one-criterion variance analysis. Journal of the American Statistical Association, 47(260):583–621.