[Math] How to calculate standard deviation of entire set from means and populations of all subsets

standard deviationstatistics

If I have a list $X$ of size $N$, where $N$ is known, but the exact values of $X$ aren't, how can I calculate the standard deviation of $X$ using only the means and sizes of $n$ separate sublists (no overlap between sublists, all objects in $N$ are contained in exactly one sublists).

Example)

$X=\{4, 9, 5, 4, 6, 10, 3, 9, 3, 2, 6, 8\}$

$A=\{4, 9, 5, 4\}$

$B=\{6, 10, 3, 9\}$

$C=\{3, 2, 6, 8\}$

Without knowing the values in any of the lists, can I determine the standard deviation of $X$ given that every object in $X$ is in one of the three sublists, and that this information is known:

$x̅_A=5.5$, $x̅_B=7$, $x̅_C=4.75$, $x̅_X=5.75$

$N_A=4$, $N_B=4$, $N_C=4$, $N_X=12$

Although my example uses sublists of equal size, my actual data does not, so I am looking for answers which aren't specific to sublists of equal size.

Best Answer

Suppose I have two distinct samples A and B and that I know the sizes, sample means, and sample SDs of both. Then I can find the mean and SD of the combined sample (A and B together).

For means: Suppose we have $n_A$, $\bar X_A$, $n_B$, and $\bar X_B.$ Then I can find the mean $\bar X_C$ of the combined sample:

$$\bar X_C = \frac{\sum_{i=1}^{n_C}X_i}{n_C} = \frac{\sum_{i=1}^{n_A}X_i+\sum_{i=1}^{n_B}X_i }{n_A + n_B} = \frac{n_A\bar X_A + n_B\bar X_B}{n_A+n_B}.$$

For variances. The following identity for the variance of a sample of $Y_i$ of size $m$ makes it possible, even if a bit tedious, to get the sample variance of a combined sample. $$S^2 =\frac{\sum_{i=1}^m (Y_i - \bar Y)^2}{m-1} = \frac{\sum_{i=1}^m Y_i^2 - m\bar Y^2}{m-1}.$$

First, use the method for finding the mean of a combined sample to find $\bar X_c$ and thus $\bar X_C^2.$

Second, use the identity to find $\Sigma_A =\sum_{i=1}^{n_A} X_i^2$ from $n_A,$ $S_A^2$ and $\bar X_A$; similarly find $\Sigma_B = \sum_{i=1}^{n_B} X_i^2$ from $n_B,$ $S_B^2$ and $\bar X_B$. Then $\Sigma_C =\sum_{i=1}^{n_C} X_i^2 = \Sigma_A + \Sigma_B.$

Finally, use the identity once again to find $S_C^2$ from $n_C = n_A + n_B,$ $\bar X_c^2$ and $\Sigma_C.$

Notes: (1) This method can be used so combine more than two subsambles. (2) Also, essentially this same method is frequently used in computer programs to update the sample mean and variance after each new observation is entered. In that case, sample A has the previous $n$ observations and 'sample' B has the one new observation. [In practice, it is easier to update $n,$ $T = \sum_{i=1}^n X_i,$ and $\Sigma = \sum_{i=1}^n X_i^2$ after each new observation is input.]