Solved – test normality for both groups when comparing from a single population

normal distributionnormality-assumptionpopulation

This should be an easy one. I'm a novice when it comes to statistics and English isn't my first language so bear with me. I have one population that numbers about 700. Of these 700, 25 are of special interest. I want to compare the age, height, weight and BMI of the 25 and 675 groups. The problem is I'm not certain which test I should use.

  1. Do I need to check the normality of both the groups I'm comparing (25 vs 675) or do I just run one check for the whole group (700)? According to my interpretation of Kolmogorov-Smirnov and Shapiro-Wilk(and visual interpretation of QQ plots etc.) the bigger samples (675 and 700) are not normally distributed in any of the variables. The smaller sample of 25, however, is normally distributed in every variable except age.

  2. What kind of test should I use to compare these two groups?

Best Answer

If the data are more or less normal in each group, you can do a two-sample t-test to compare the scores. If you want to compare all four scores at once, you can do a multivariate t-test.

The two-sample t-test really only requires that the sample averages of the two groups be normal. This will happen if the data are normal, but it is also a fair assumption when the data are sort of normalesque --i.e., one mode, more or less symmetric. One of the most important theorems in Statistics is the Central Limit Theorem, which states that in most situations, the sample average tends towards normality as the sample gets large. So even if your big sample is not normal, the average of 675 items will be pretty close, and your t-test will work. In fact, if the original data are symmetric and you don't have wild outliers, the average of a sample of 25 is pretty close to normal as well. convergence can be rapid.

Now, a word about statistical tests. Another big theorem states that when the null hypothesis is false, your test will reject the null when the sample gets large. So when you have a big sample, like 675, even a small departure from normality with be picked up by Kolmogorov-Smirnov. A similar departure may not be detected by a sample of size 25. That's why you think your small sample is normal and your large sample is not.

It's also why a lot of people don't test for normality before carrying out a subsequent test. A better plan is to do your tests, and then look at the residuals and see if they look normal. Plot a histogram, or do a quantile plot. Whatever software you are using will have options for doing this, or you should switch software.

Rather than check for normality using a test, the better approach is to graph the data, examine outliers (if they exist), and possibly remove them. Then do the comparison.

Some people would advise a graphical approach to comparing your groups: do boxplots for the 25 and the 675. I like the idea of a formal test in this case because the sample sizes differ so much. The average of the 25 could differ a lot from the average of the 675 due simply to random fluctuation. That sort of distinction can be hard to eyeball on a boxplot, so best to do the test.