Solved – Analysis of variance for nonnormal data with unequal variance

anovadescriptive statisticsheteroscedasticitykruskal-wallis test”

I would like to ask whether it is possible to perform an analysis of variance on data that is not normally distributed and has unequal variance, but I have large enough sample size.

I have read that for normal distributed data with equal variance we can perform ANOVA test. It also says that the assumptions do not need necessarily to be be met in case that we have large enough sample size (is this statement true for both assumptions – variance equality and normal distribution?).

An alternative for ANOVA might be Welch's anova (if we have unequal variation), but it says that normal distribution is required. Unfortunately I can not find if normality assumption can be violated if we have large enough sample size (for Welch's anova test).

Another alternative might be a Kruskal–Wallis H test since it does not require normally distributed data, but in some articles it says that 'roughly 'equal variance between groups must be met.

The problem is that I am not sure what 'roughly' exactly means. In my case the values are from an interval [-6,6] and it can only be a whole number. My maximal standard deviation difference is 1, which I think is not large since the range of values is 12. If I perform for example Levene Test for Equality of Variances it gives me p-value less than 0.05 which means that data has unequal variance? But can I ignore the results of the test since the variance equality only needs to be 'roughly' met?.

To conclude, I would like to know which test can I use if I have large enough sample size with non-normal distribution with unequal variance (can I use the tests that I have mentioned above or there exists another alternative for my scenario)?

Best Answer

Generally speaking, a one-way ANOVA is reasonably robust against non-normality as long as skewness is slight and there are no far outliers. If your observations are integers between $\pm 6,$ there is no chance for far outliers, and I suppose group means of moderate-sized samples will be nearly normal.

However, inequality of variances can easily give misleading results in a one-way ANOVA. So I think it is especially worthwhile to protect against effects of heteroskedasticity.

I suggest you use the version of a one-way ANOVA implemented in the oneway.test procedure in R. This ANOVA does not assume equal variances.

Here is an example with simulated data for 4 levels of the factor (groups) and $r = 20$ replications per factor. Of course, my simulated data may not imitate your data well, but you can see how oneway.test works.

    set.seed(2020)
    n = 20;  k = 4
    x1 = rbinom(n, 12, .3) -6
    x2 = rbinom(n, 12, .35)-6
    x3 = rbinom(n, 12, .4) -6
    x4 = rbinom(n, 12, .4) -6
    x = c(x1, x2, x3, x4)
    g = as.factor(rep(1:k, each=n))

    var(x1); var(x2); var(x3); var(x4)
    [1] 2.042105
    [1] 4.642105
    [1] 3.628947
    [1] 2.515789

    boxplot(x ~ g, col="skyblue2", pch=20, horizontal=T)

enter image description here

    stripchart(x ~ g, pch=20, meth="stack")

enter image description here

    oneway.test(x ~ g)

           One-way analysis of means (not assuming equal variances)

    data:  x and g
    F = 4.4883, num df = 3.000, denom df = 41.779, p-value = 0.008076

There are significant differences among group means. Still avoiding the assumption of equal variances, you can use Welch 2-samples for ad hoc comparisons, using Bonferroni (or some other method) to protect against false discovery.

There is a significant difference between Groups 1 and 3:

    t.test(x1, x3)

            Welch Two Sample t-test

    data:  x1 and x3
    t = -3.0986, df = 35.241, p-value = 0.003806
    alternative hypothesis: true difference in means is not equal to 0
    95 percent confidence interval:
     -2.7307616 -0.5692384
    sample estimates:
    mean of x mean of y 
        -2.60     -0.95 

But there is no significant difference between Groups 3 and 4 (not surprising because they were simulated from the same distribution.)

    t.test(x3,x4)$p.val
    [1] 0.7881982
Related Question