Solved – Find the minimum sample size to test for a normal distribution and 1-way ANOVA

anovachi-squared-testpythonsample-sizet-test

I have been heuristically choosing sample sizes and would like to find a more quantitative method.

For each group of my data I want to verify that they are normally distributed. Then I want to check which groups are different. How should I find a sample size for both tests? And is my current methodology sound?

Methods

I'm generating between 2 and several groups of simulation data. Each group has output from a different model, and each value in a group is a floating point number coming from a single simulation. I want to run my simulations until all groups are normally distributed or some threshold for max iterations is reached. This means all groups have the same size.

Usually all the data in each group is within 1% of each other, so I've been using the magic number 30 for my sample size. I'm testing that each group is normally distributed with a chi-squared test. My expected values for chi-square are just the group average.

Finally, I check that all groups are the same with a 1-way ANOVA test. If a group is dissimilar, I use a t-test to find unique groupings.

Best Answer

Sample size for ANOVA. If you know the number of groups, the common variance of the group populations, the size of the difference in means you want to detect, and the power with which you want to detect that difference, then you can find the number of replications needed in each group. Most statistical software packages have a 'power and sample size' procedure for making such determinations.

In an actual situation you will have to guess some of these numbers, and others are a matter of how much risk one is prepared to take. But results from similar past experiments can serve as a guide. (Not to mention the budget for the study being planned.)

Here is output from Minitab's power and sample size procedure: for 90% power, testing at the 5% level, to detect differences as small as 2 units among 4 groups having standard deviation 1. The result is that $n = 9$ subjects are needed in each of the four groups.

Power and Sample Size 

One-way ANOVA

α = 0.05  Assumed standard deviation = 1

Factors: 1  Number of levels: 4


   Maximum  Sample  Target
Difference    Size   Power  Actual Power
         2       9     0.9      0.932577

The sample size is for each level.

enter image description here

Testing for normality. Finding the sample size necessary to judge normality is much more difficult. In the real world, nothing is exactly normally distributed, so you're probably testing whether the data are nearly enough normal that it is OK to do a standard ANOVA. That is more often a matter of experience and judgment than a matter of probability computation.

It takes a certain number of observations before a test of normality will work. Also, for very large sample sizes, such tests can be 'too fussy', rejecting the null hypothesis of normality for data that one supposes is near enough to normal to be OK. For example, rounded normal data is not normal, and in the real world all data need to be rounded to some number of decimal places.

In the R session below, we sample $n = 1000$ observations from $\mathsf{Norm}(\mu=50, \sigma=3).$ The Shapiro-Wilk test of normality is one of the better and most commonly used. It rejects the sample, known to have been normal before rounding, as not normal. This is not even a close call; the P-value is nearly $0.$ It is not as if we have 'rounded away' the essence of the data. There are 20 uniquely different integer values among the rounded data. (Even so, rounding to one or two decimal places might have been a better choice.)

set.seed(1015)     # for reproducibility
x = round(rnorm(1000, 50, 3))
shapiro.test(x)

        Shapiro-Wilk normality test

data:  x
W = 0.99024, p-value = 3.406e-06
length(unique(x))  
[1] 20

A normal probability plot of the rounded data looks nicely linear (except for the usual wobbles in the far tails) and their histogram is well-matched by the normal density function of the population from which the data were sampled.

enter image description here

By contrast, tests of normality have notoriously low power to detect non-normality in small samples. In the R session below, samples of size 10 from a uniform population (which lacks tails) and from a gamma distribution (which is strongly right-skewed), both 'pass' the Shapiro-Wilk test (5% level):

set.seed(1492)
x = runif(10, 2, 10)
shapiro.test(x)$p.val
[1] 0.2535185

x = rgamma(10, 10, .2)
shapiro.test(x)$p.val
[1] 0.6517063

All 'rules-of-thumb' are wrong some of the time. But maybe one could say that the 'Goldilocks zone' of normality testing is from a few dozen to a few hundred--depending on the reason for testing normality.

Related Question