Generally it is the residuals that need to be normally distributed. This implies that each group is normally distributed, but you can do the diagnostics on the residuals (values minus group mean) as a whole rather than group by group. It is possible (and even common) that the data will be approximately normal within each group, but since the group means differ the overall dataset will be quite non-normal, but you can still use normal theory tests for this case.
Note that the real question is not "exactly normal", but rather "normal enough for the given problem". With small datasets the question of normality is the most important, but you have low power to detect non-normality (unless it is very extreeme), with large datasets the Central Limit Theorem kicks in so your data does not need to be that normal, but you have high power to detect small departures from normality. So when doing formal tests of normality as a condition for doing t-tests or anova you are either in the situation where you have a meaningless answer to a meaningful question, or you have a meaningful answer to a meaningless question (there may be some middle size where both are meaningful, but I expect that the middle range is really where both are meaningless).
So, no just because a small sample size does not reject the null does not mean that it is safe to use normal theory methods. Knowledge about the source of the data and some diagnostic plots are likely to be more useful in that decision, or if you are worried about non-normality just go straight to the non-parametric tests.
If you really feel the need for a p-value testing exact normality then you can use the SnowsPenultimateNormalityTest
function in the TeachingDemos
package for R (but be sure to read the help page).
Another option for testing "normal enough" if you need more than the diagnostic plots is to use the methodology in:
Buja, A., Cook, D. Hofmann, H., Lawrence, M. Lee, E.-K., Swayne,
D.F and Wickham, H. (2009) Statistical Inference for exploratory
data analysis and model diagnostics Phil. Trans. R. Soc. A 2009
367, 4361-4383 doi: 10.1098/rsta.2009.0120
(the vis.test
function in the TeachingDemos
package of R
is one implementation of this).
The impartant thing to take away is that knowledge about the process that produced your data is much more important than the output from some program/algorythm written by someone who knows/knew much less about your data and question than you do.
Per @seanv507's link to Wikipedia:
This formulation is based on the class of twice differentiable functions, and
The roughness penalty based on the second derivative is the most
common in modern statistics literature, although the method can easily
be adapted to penalties based on other derivatives.
This answers my question.
Best Answer
Rarely if ever a parametric test and a non-parametric test actually have the same null. The parametric $t$-test is testing the mean of the distribution, assuming the first two moments exist. The Wilcoxon rank sum test does not assume any moments, and tests equality of distributions instead. Its implied parameter is a weird functional of distributions, the probability that the observation from one sample is lower than the observation from the other. You can sort of talk about comparisons between the two tests under the completely specified null of identical distributions... but you have to recognize that the two tests are testing different hypotheses.
The information that parametric tests bring in along with their assumption helps improving the power of the tests. Of course that information better be right, but there are few if any domains of human knowledge these days where such preliminary information does not exist. An interesting exception that explicitly says "I don't want to assume anything" is the courtroom where non-parametric methods continue to be widely popular -- and it makes perfect sense for the application. There's probably a good reason, pun intended, that Phillip Good authored good books on both non-parametric statistics and courtroom statistics.
There are also testing situations where you don't have access to the microdata necessary for the nonparametric test. Suppose you were asked to compare two groups of people to gauge whether one is more obese than the other. In an ideal world, you will have height and weight measurements for everybody, and you could form a permutation test stratifying by height. In a less than ideal (i.e., real) world, you may only have the mean height and mean weight in each group (or may be some ranges or variances of these characteristics on top of the sample means). Your best bet is then to compute the mean BMI for each group and compare them if you only have the means; or assume a bivariate normal for height and weight if you have means and variances (you'd probably have to take a correlation from some external data if it did not come with your samples), form some sort of regression lines of weight on height within each group, and check whether one line is above the other.