[Math] Combined Data Set Standard Deviation

statistics

Can the sample deviation of a combined data set be lower than the sample deviations of each separate data set?
So, at first thought, my answer was no and that it can be equal to that of each separate data set at the lowest. However, when I tried to show this, I supposed
that the mean, sample standard deviations, and sample size of both data sets are equal.

Fooling around with some examples such as:
$$\text{data set 1 = data set 2: 1,2,3,4,5} \quad N=5, \mu=3, s\approx 1.58 \Rightarrow$$
$$ \text{combined data set: 1,1,2,2,3,3,4,4,5,5} \quad N=10,\mu=3 s\approx 1.49$$
So I'm thinking that for all sample data sets in which the sample sizes, sample standard deviations, and mean are equal, the sample standard deviation of the combined
data set will be lower, while it will approach an equal value as $N$ goes to infinity.

How can I show this mathematically?

Best Answer

Usually the sample variance is taken to be the unbiased estimator of $\sigma_X^2$:$$ s^2\equiv \frac{1}{N-1} \sum_{i = 1}^{N} (x_i - \bar{X})^2$$

So when you combine two identical data sets, the sample size doubles $N \to 2N$, the sample mean $\bar{X}$ is unchanged, the sum of squares $\sum_{i = 1}^{N} (x_i - \bar{X})^2$ also doubles, and you divide it by a number that is slightly less than doubled $2N -1$.

When $N \to \infty$, the sample size approaches "being doubled" $(2N-1)/N \to 2$ and $s^2 \to \sigma_X^2$, approaching what you call "equal value".