[Math] Calculating mean and standard deviation of a sampling mean distribution

samplingstatistics

So in class we were asked to find the mean and SD for the given dataset belowenter image description here

The data set represents a sampling mean distribution for cigarettes smoked per day and no of people in each group. I could easily calculate mean using the formula as $\frac{\sum f_ix_i}{\sum f_i}$. However the question asks to find Standard deviations so does that mean for every row I need to calculate a different SD using $$\text{SE}=\frac{\sigma}{\sqrt{n}}$$. But that would be wrong as all the data sets are picked from the same population so they cant give different SDs. Also to find skewness I need to find SD of the population. Will it be just the SD of means or will it be something different?

Best Answer

Judging from the way the question is phrased in the screen capture, I presume that the calculation for the overall sample variance should be simply $$s^2 = \frac{1}{N - 1} \sum_{i=1}^m (n_i - 1) s_i^2, \tag{1}$$ where $N$ is the overall sample size, $n_i$ is the sample size of group $i$, and $s_i^2$ is the sample variance of group $i$, in this case $$s_i^2 = n_i SE_i^2,$$ where $SE_i$ is the standard error of group $i$.

However, the basis for this calculation assumes that the within-group sample means are equal. If they are not, then this calculation is only an approximation, because the overall sample variance is based on the squared deviations from the overall mean, not the within-group means. I discussed this issue in two other posts here:

Can I work out the variance in batches?

How do I combine standard deviations of two groups?

However, the calculation for six groups is going to be somewhat tedious and not recommended without a computer. It is a common misconception (hence, the existence of questions like these) that the overall sample standard deviation has no contribution from the variation that exists between groups.


Allow me to illustrate with the computation with only the first two groups. The table may be written as $$\begin{array}{c|ccc|cc} i & n_i & \bar x_i & SE_i & s_i^2 & (n_i - 1)s_i^2 \\ \hline 1 & 25 & 0.31 & 0.08 & 0.16 & 3.84 \\ 2 & 57 & 0.42 & 0.10 & 0.57 & 31.92 \\ \end{array}$$

Then we can agree that the overall sample mean for the first two groups is $$\bar x = \frac{n_1 \bar x_1 + n_2 \bar x_2}{n_1 + n_2} = 0.386463.$$ The supposed overall sample variance would be, according to formula $(1)$ above, $$s^2 = \frac{1}{25+57 - 1} \left( (n_1 - 1)s_1^2 + (n_2 - 1)s_2^2 \right) = \frac{3.84 + 31.92}{81} = 0.441481.$$ But as I stated above, this is incorrect. The correct formula contains the additional term $$\frac{n_1 n_2 (\bar x_1 - \bar x_2)^2}{(n_1 + n_2)(n_1 + n_2 - 1)} = \frac{25(57)(.31 - .42)^2}{(25+57)(25+57-1)} = 0.00259598,$$ making the true overall sample variance equal to $$s^2 = 0.441481 + 0.00259598 = 0.444077$$ for the first two groups. If so inclined, you could repeat this calculation on the next two groups, and then the last two groups, giving you three pairs of aggregated sample means and sample variances. Then you could merge these two at a time with two more calculations. But this is not what I think the author of the question had in mind when it was written, because the exact formula I am using, with the adjustment for between-group variance, is not commonly known even to experienced statisticians.

Related Question