I am doing a study on 3 drugs, comparing response pre-post treatment. My objective is to know if these drugs are effective and which one is better. I used non parametric tests since the results weren't normally distributed and were non transformable. Two drugs were effective with a significant difference on the Wilcoxon signed test and a third showed no significant difference. Yet when comparing the post-results using the Kruskal-Wallis test, no significant difference was observed and the pre-results showed non significant differences between these groups using the Kruskal-Wallis test as well. Why is that? Did I choose the wrong tests?
Solved – getting inconsistent results from Wilcoxon signed rank test and Kruskal-Wallis test
hypothesis testingkruskal-wallis test”nonparametricrepeated measureswilcoxon-signed-rank
Related Solutions
This is not a serious or fatal flaw. Here's why.
In its most general form the Kruskal-Wallis test is a test for stochastic dominance among $k$ groups. That is, the null hypothesis is:
H$_{0}\text{: P}\left(X_{i} > X_{j}\right) =0.5$
for all groups $i$ and $j$ from $1$ to $k$, with the alternative hypothesis:
H$_{\text{A}}\text{: P}\left(X_{i} > X_{j}\right) \ne 0.5$
for at least one group $i \ne j$. Putting the null hypothesis into plain language, no stochastic dominance would mean that a randomly drawn observation from group $i$ is just as likely to be larger than to be smaller than a randomly drawn observation from group $j$, for any two groups. Putting the alternative hypothesis into plain language, the existence of stochastic dominance would mean that a randomly drawn observation from at least one group $i$ is more likely to be larger than to be smaller than a randomly drawn observation from a different group $j$.
These hypotheses make no assumptions about the shape, width or location of the distributions of the groups being compared other than within each group observations are independently and identically distributed (i.e. distributed i.i.d.).
Sometimes researchers want to use these tests (and other nonparametric tests) as tests for median difference (instead of tests for stochastic dominance). However, in order to interpret test results this way one has to make the additional assumptions that the shapes and widths of the distributions of all groups are the same except for location. If you must provide inference about median difference and one of your groups has a differently shaped distribution than another, then yes, this test is fatally flawed. But stochastic dominance is often a good enough kind of inference to make.
Wilcoxon signed rank tests seem appropriate since they reflects the fact that measures are taken repeatedly on the same subjects (which increases the tests power to detect real effects) but are modest enough to recognize that grades on writing assignments are only ordinal, but not interval variables (i.e. the difference between a B+ and an A could be much smaller than the difference between an A and an A+ etc.).
Doings the Wilcoxon signed rank tests assumes, that even if the steps between grades are not uniformly high, you as a grader would know for any two evolutions you compare among the students which step is bigger of the two. If you cannot do that, you cannot rank the changes and without a ranking there will not be a rank test. You would be limited to a sign test: basically just counting how many students improved, how many stayed put and how many deteriorated. If there are many more improvements than deteriorations, your test will be significant. Such a test is obviously less powerful since it has no notion of how large any of the improvements were. I do not think you need to use this one. If only you can establish a ranking of improvements, you don't.
If on the other hand your literacy score is much more objectively countable like for example counting the number of mistakes per 100 words (I'm no expert in the field, but you see what I mean with objective I believe), then you can even use paired t-tests. They will have higher statistical power to detect real effects.
When used in the right conditions as described above, the power of the tests to detect existing changes compares as follows:
$$\text{t-test} \geq \text{wilcoxon signed rank test} \geq \text{sign test}$$
In any of the three options, use two sided tests since the possibility that a training session might have deteriorated performance is real. That is what everybody does. Doctors also hope their medication works better than a placebo, but they use two sided tests because it might be even worse than doing nothing. (One sided tests would just make your $\alpha$ level less stringent and are frowned upon.)
Just to be sure, these 3 school classes only exist so that you can get a big enough sample size? You have not chosen the three classes to purposefully represent for example one posh private school, one average school and one underprivileged school? If yes, you will need a more complex statistical methodology to include that information in your analysis as well.
Now to the most important part: The preceding caveats and options notwithstanding, you still need to control your p-value cutoffs for multiple testing. It is very important not to confuse two concepts here:
- your tests are paired on the student level since you observe the same student multiple times (as opposed to observing a different class of 90 students each time after one of your training intervals)
- your tests are pairwise since you compare multiple intermediary situations (as opposed to only comparing the before-after states)
Being paired is taken care of by signed rank tests (or paired t-tests or sign tests), being pairwise requires the following additional precautions:
When the null hypothesis is true and there is no real effect, you still have the possibility of a false discovery proportional to the $\alpha$ cutoff that you compare your p-value to. That is true for every test, so if you do enough tests, you are bound to find some significant results that are false. You need to correct for this inflated chance of false discovery. The easiest way is the Bonferroni correction, just divide your $\alpha$ level by the number of tests. For example, you would be comparing all your p-values against a cutoff of $\alpha/9=0.55\%$ because you perform 9 tests instead of the usual $5\%$ cutoff for a single test. You can see that this is quite a stringent restriction. Holm's method will be a little bit less stringent while still not inflating the chance of false discoveries, it is preferable for that reason.
Practically speaking, are you sure your results are actionable? If you find out for example that the first two training intervals didn't help, but the third and fourth did help, then the fifth and sixth did active harm, the seventh was neutral again and the last two helped, can you translate such mixed results into actionable recommendations? Recommending to skip the intervals 5 and 6 and to put more emphasis on intervals 3 and 4 is only actionable if each interval did something different with the students. If you can explain based on some theory (not only the statistical data) why it could be that some training sessions helped but others didn't, that's insightful. If all the intervals were supposed to do more of the same but didn't, this will be hard to make sense of.
Also, even if you do the 9 intermediary comparisons, you can still also do a before-after comparison. You just need to adjust your cutoffs for one more test (10 instead of 9)
Best Answer
What you should always keep in mind, is that The Difference Between “Significant” and “Not Significant” is not Itself Statistically Significant -- there is a nice paper with this title by Gelman and Stern, which I link to here, but the idea is very simple. Here is how they start explaining it:
In your case, when you conduct three separate Wilcoxon tests, you might get p-values e.g. $0.045$, $0.045$, and $0.055$. First two are "significant" according to the common $p<0.05$ criterion, and the third one is not. However, the difference between p-values is tiny, and so it is very well possible that if you compare three groups between each other, then you will fail to get any significant difference. Which seems to be exactly your case.
In addition: doing Kruskal-Wallis on the pre and post measures separately is probably not the best approach. You can subtract pre from post and do one Kruskal-Wallis on these differences. It is of course still possible (as I explain above) that you will not get a significant difference, but this is a more correct approach.
Just to stress it again: if one drug comes out with significant pre-post difference and another one with insignificant, it is (by itself) no reason whatsoever to believe that one drug is better than another. Unfortunately, it is a very widespread mistake.