ANOVA – ANOVA vs Kruskal Wallis: Differences and When to Use Each Test

anovakruskal-wallis test”nonparametric

I understand that ANOVA is a parametric test while Kruskal Wallis is a non-parametric test. My dataset is not normally distributed but heavily skewed (Skewness: 2.5, Kurtosis: 26.1).

Because of that, I would usually perform a Kruskal Wallis test. However, based on the Central Limit Theorem and the size of my dataset (n=5000), I understand that I could also do the ANOVA test. I tried both and my results are as follows:

ANOVA: F = 1.42 | p = 0.23
Kruskal Wallis: F = 15.82 | p=0.0012

Questions:

  1. Is my understanding described above correct or am I missing something?
  2. I assume the difference comes from the ranking approach of the non-parametric test, but how do I decide which test to use here?

Best Answer

I understand that ANOVA is a parametric test while Kruskal Wallis is a non-parametric test.

True, in the sense that the null distribution of the test statistic in ANOVA is derived under a specific finite-parametric distributional assumption (in this case, of normality within-groups with constant variance across groups for the usual ANOVA), while the Kruskal-Wallis does not make a specific distributional assumption.

My dataset is not normally distributed but heavily skewed (Skewness: 2.5, Kurtosis: 26.1).

It's pretty kurtotic, but in my book 2.5 is not quite into the 'very heavily skewed' region (I tend to place that above 3.5 or so). I see a lot of skewed data, though, so my sense of just how skewed things can easily get may differ from yours, simply because we would typically encounter different sorts of data.

Because of that, I would usually perform a Kruskal Wallis test.

I encourage you to attempt to broaden your set of considered parametric distributional models. (I have nothing against the Kruskal-Wallis, but if you know things about your distribution, you can probably take advantage of that knowledge.)

The main reason to choose a Kruskal-Wallis test over an ANOVA would be to maintain your chosen significance level under the assumption that the distributions would be the same under $H_0$. However, it doesn't necessarily offer as much on the power side under fairly skewed distributions; it does reasonably well on symmetric, heavy-tailed distributions and pure-shift alternatives though.

However, based on the Central Limit Theorem and the size of my dataset (n=5000),

That's a large sample size, but even at that sample size the central limit theorem doesn't necessarily imply that the ANOVA statistic will have very close to an F-distribution under $H_0$.

What you're relying on is two different things:

  1. That the distribution of sample means in each of the samples is close to normal at the sample size in each group.

  2. That the distribution of the F-statistic converges to the distribution of the numerator divided by the fixed value that the denominator should approach (i.e. $\sigma^2$).

In this particular case, given the information we have, I suspect it will be close enough for most purposes (though I have seen cases where it would not).

However, that really only gets you to the test having about the right significance level (that is, its rejection rate under $H_0$ should be about $\alpha$). It does not guarantee that the power will be good relative to what could be attained with a considered choice of distributional model.

how do I decide which test to use here?

Never calculate two tests and then try to choose one. Even if you somehow manage to avoid engaging in p-hacking, you'll still look like you are - unless your habit is always to choose the higher p-value.

Choosing what hypothesis you test based on what you find in your sample (as you appear to be doing here) is testing hypotheses suggested by the data, which is problematic.

If you're as concerned about maintaining type I error rates as you appear to be, this practice should concern you a great deal, since it will impact your type I error rate (specifically, it will typically tend to inflate it, but the extent is difficult to quantify because it relies on people's perceptions).

Set up your testing protocol before you gather data, not after. This will require you to get a good idea of how your variables will tend to behave before gathering your sample. Choosing good distributional models - so that you have good power may be quite involved, but a lot of people pay it hardly any mind at all, especially not at the planning stage.

If this is simply impossible for some reason, draw a larger sample size, and split it into two parts, one for model identification and one for testing (with the testing part being sequestered from view until after the protocol for it has been determined.