For small sample sizes, use Fisher's exact test, because the $\chi^2$ test sampling statistics has only approximately the $\chi^2$ distribution, and this approximation is problematic for small sample sizes.
While lower sample size decreases the power of the test, the p-values (and not the sample size) are indicators of the statistical significance. A significant p-value stays significant whatever the sample size; the sample size has been taken care of through the calculation of the test statistic.
However, someone might claim that a small sample size is more likely to be biased. This is not necessarily true, but I think there might exist a correlation between the study sample size and whether the data was collected in an unbiased way as it should.
Described below are three approaches to estimating sample size for completely randomized designs. Note that the procedures differ in terms of the information you must provide.
Approach #1 (requires most information)
To calculate sample size, the researcher first needs to specify:
1) level of significance, α (alpha)
2) power, 1-β
3) size of the population variance, σ2
4) sum of the squared population treatment effects.
In practice, 3 and 4 are unknown. However, you can estimate both from a pilot stud. Alternatively, you might estimate these parameters from previous research.
As an example, let's assume we conducted a pilot study and estimated the population variance and sum of squared population treatment effects. If we let α = 0.05 and 1-β = 0.80, then we can use trial and error to calculate the required sample size. The test statistic you calculate is phi (Φ), where:
Φ = (n^0.5)[(average of squared treatment effects/population variance)], where n is a sample size value. The Φ test statistic can then be used to look up power that corresponds to the sample size in Tang's Charts (citation below).
Approach #2
If accurate estimates of #3 and #4 are not available from a pilot study or previous research, then one can use an alternative approach that requires a general idea about the size of the difference between the largest and smallest population means relative to the standard deviation of the population standard deviation:
μmax - μmin = dσ, where d is a multiple of the population standard deviation. In other words, this approach allows you to calculate the sample size if you wanted to detect a difference between the highest and smallest means that would equal to some multiplier of the pop. standard deviation (whether it's one half, or 1.5, or anything else). For the math for this approach, see Kirk (2013) [I have a PDF].
Approach 3
If you know nothing about #3 and #4 from Approach 1, and are unable to express μmax - μmin as a multiple of pop. standard deviation, then you can use strength of association or the effect size to calculate the sample size. This approach also requires the researcher to specify the level of significance, α, as well as the power, 1 - β.
Remember that the strength of an association indicates the proportion of the population variance in the dependent variable that is accounted for by the independent variable. Omega squared is used to measure the strength of association in analysis of variances with fixed treatment effects, whereas intraclass correlation is used in analysis of variance with random treatment effects.
Based on Cohen (1988), we know that (for strength of association):
ω^2 = 0.010 is a small association
ω^2 = 0.059 is a medium association
ω^2 = 0.138 or larger is a large association
And for effect size:
f = 0.10 is a small effect size
f = 0.25 is a medium effect size
f = 0.40 or larger is a large effect size.
Back to the Approach 3: if we have a completely randomized design with p treatment levels, then we can calculate the sample size necessary to detect any magnitude of strength of association OR any magnitude of effect size (the mathematics are equivalent). Again, if you are interested for working out the math for this approach, I refer you to Kirk (2013).
Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ: Lawrence Erlbaum
Kirk, R.E. (2013) Experimental Design: Procedures for the Behavioral Sciences
Tang, P. C. The power function of the analysis of variance test with tables and illustrations of their use.Statist. res. Memoirs. 1938,2, 126–149.
Best Answer
Sample size for ANOVA. If you know the number of groups, the common variance of the group populations, the size of the difference in means you want to detect, and the power with which you want to detect that difference, then you can find the number of replications needed in each group. Most statistical software packages have a 'power and sample size' procedure for making such determinations.
In an actual situation you will have to guess some of these numbers, and others are a matter of how much risk one is prepared to take. But results from similar past experiments can serve as a guide. (Not to mention the budget for the study being planned.)
Here is output from Minitab's power and sample size procedure: for 90% power, testing at the 5% level, to detect differences as small as 2 units among 4 groups having standard deviation 1. The result is that $n = 9$ subjects are needed in each of the four groups.
Testing for normality. Finding the sample size necessary to judge normality is much more difficult. In the real world, nothing is exactly normally distributed, so you're probably testing whether the data are nearly enough normal that it is OK to do a standard ANOVA. That is more often a matter of experience and judgment than a matter of probability computation.
It takes a certain number of observations before a test of normality will work. Also, for very large sample sizes, such tests can be 'too fussy', rejecting the null hypothesis of normality for data that one supposes is near enough to normal to be OK. For example, rounded normal data is not normal, and in the real world all data need to be rounded to some number of decimal places.
In the R session below, we sample $n = 1000$ observations from $\mathsf{Norm}(\mu=50, \sigma=3).$ The Shapiro-Wilk test of normality is one of the better and most commonly used. It rejects the sample, known to have been normal before rounding, as not normal. This is not even a close call; the P-value is nearly $0.$ It is not as if we have 'rounded away' the essence of the data. There are 20 uniquely different integer values among the rounded data. (Even so, rounding to one or two decimal places might have been a better choice.)
A normal probability plot of the rounded data looks nicely linear (except for the usual wobbles in the far tails) and their histogram is well-matched by the normal density function of the population from which the data were sampled.
By contrast, tests of normality have notoriously low power to detect non-normality in small samples. In the R session below, samples of size 10 from a uniform population (which lacks tails) and from a gamma distribution (which is strongly right-skewed), both 'pass' the Shapiro-Wilk test (5% level):
All 'rules-of-thumb' are wrong some of the time. But maybe one could say that the 'Goldilocks zone' of normality testing is from a few dozen to a few hundred--depending on the reason for testing normality.