Nemenyi Post-Hoc Test – How to Apply Correctly After Friedman Test

multiple-comparisonsnonparametricpost-hoc

I'm comparing the performance of multiple algorithms on multiple data sets. Since those performance measurements are not guaranteed to be normally distributed, I chose the Friedman Test with the Nemenyi post-hoc test based on Demšar (2006).

I then found another paper that, aside from suggesting other methods like the Quade test with subsequent Shaffer post-hoc test, they apply the Nemenyi test differently.

How do I apply the Nemenyi post-hoc test correctly?

1. Using the Studentized range statistic?

In Demšar's paper it says to reject the null hypothesis (no performance difference of two algorithms) if the average rank difference is greater than the critical distance CD with
$$
CD = q_{\alpha}\sqrt{{k(k+1)}\over{6N}}
$$

"where critical values qα are based on the Studentized range statistic divided by $\sqrt{2}.$"

After some digging I've found that you those "critical values" can be looked up for certain alphas, for example in a table for $\alpha = 0.05$, for infinite degrees of freedom (at the bottom of each table).

2. or using the normal distribution?

Just when I thought I knew what to do, I found another paper that confused me again, because they were only using the normal distribution. Demšar is stating a similar thing at page 12:

The test statistics for comparing the i-th and j-th classifier using these methods is
$$
z = {{(R_i − R_j)}\over{\sqrt{{k(k +1)}\over{6N}}}}
$$

The z value is used to find the corresponding probability from the table of normal distribution, which is then compared with an appropriate $\alpha$. The tests differ in the way they adjust the value of $\alpha$ to compensate for multiple comparisons.

At this paragraph he was talking about comparing all algorithms to a control algorithm, but the remark "differ in the way they adjust … to compensate for multiple comparisons" suggests that this should also hold for the Nemenyi test.

So what seems logical to me is to calculate the p-value based on the test statistic $z$, which is normally distributed, and correct that one by dividing through $k(k-1)/2$.

However, that yields completely different rank differences at which to reject the null hypothesis. And now I'm stuck and don't know which method to apply. I'm strongly leaning towards the one using the normal distribution, because it is simpler and more logical to me. I also don't need to look up values in tables and I'm not bound to certain significance values.

Then again, I've never worked with the studentized range statistic and I don't understand it.

Best Answer

I also just started to look at this question.

As mentioned before, when we use the normal distribution to calculate p-values for each test, then these p-values do not take multiple testing into account. To correct for it and control the family-wise error rate, we need some adjustments. Bonferonni, i.e. dividing the significance level or multiplying the raw p-values by the number of tests, is only one possible correction. There are a large number of other multiple testing p-value corrections that are in many cases less conservative.

These p-value corrections do not take the specific structure of the hypothesis tests into account.

I am more familiar with the pairwise comparison of the original data instead of the rank transformed data as in Kruskal-Wallis or Friedman tests. In that case, which is the Tukey HSD test, the test statistic for the multiple comparison is distributed according to the studentized range distribution, which is the distribution for all pairwise comparisons under the assumption of independent samples. It is based on probabilities of multivariate normal distribution which could be calculated by numerical integration but are usually used from tables.

My guess, since I don't know the theory, is that the studentized range distribution can be applied to the case of rank tests in a similar way as in the Tukey HSD pairwise comparisons.

So, using (2) normal distribution plus multiple testing p-value corrections and using (1) studentized range distributions are two different ways of getting an approximate distribution of the test statistics. However, if the assumptions for the use of the studentized range distribution are satisfied, then it should provide a better approximation since it is designed for the specific problem of all pairwise comparisons.

Related Question