[Math] Shouldn’t sample standard deviation DECREASE with increased sample size

probabilitystatistics

I'm doing an experiment comparing 10,000 rolls of 5 dice (those 5 rolls were then summed, so the population mean is 17.5) and dividing them into sets of ten with sample size N = 10. The same experiment is repeated except with 3000 rolls, but the sample size decreases to N = 3. Both experiments have 1000 samples, just with different sample sizes.

Now, I took the AVERAGE sample standard deviation for each experiment and to my great confusion, the experiment with the SMALLER sample size had a SMALLER average sample standard deviation than the experiment with the LARGER sample size. I thought this was a fluke and re-generated the random numbers only to find the same result.

In other words, this means that the sample standard deviation for each sample of N =3 is on average smaller than each sample of N = 10.
How is this possible?

Intuitively, one would think that with a larger sample size the "spread" between values would be decreased because any individual improbable value would have less effect on the spread. Sort of like for the mean.
While for a smaller sample size, the variables would be less "controlled" and thus on average more spread out.

Is my intuition just wrong? And is this plain to see mathematically?

FOLLOW UP: I talked to my professor and he said that a weird phenomenon happens in statistics where sample standard deviations tend to be underestimations rather than overestimations of the population standard deviation. And that as sample size goes up, the sample standard deviation goes up because it becomes "more accurate."

Is this true? And why does that happen? Is there both an intuitive explanation as well as a mathematical proof?

Best Answer

Absent further clarification from the OP, here's what I think is happening:

Each sample consists of $N$ values drawn from a binomial distribution spanning the integer range $[5, 30]$. As a population, the draws from this binomial distribution have mean $17.5$ and standard deviation about $3.82$ (I think, I have to double check this).

With a sample size of just $N$, however, the average of the sample is not generally $17.5$. It will be some value which is overall closer to the sample than the population mean would be. Hence, the standard deviation of that $N$-count sample, treated as a population, will systematically underestimate the standard deviation of the population.

For example, with $N = 3$, if you draw $12, 14, 16$, you have an average of $14$. The standard deviation of the sample, treated as the population, is about $1.63$. But the RMS distance from the actual population mean of $17.5$ is $6.69$. The disparity arises from the sample average being closer to the data, overall, than the population mean.

The larger $N$ is, the smaller the expected disparity, and that's perhaps why you obtain a smaller standard deviation for $N = 3$ than you do for $N = 10$.


ETA: Dividing by $N-1$ instead of $N$ should produce an unbiased estimator for the population standard deviation.

Related Question