Solved – Why is the standard deviation for this data zero, and what does that imply

standard deviation

I found the following data for 1000 rolls of a 20-sided die by a dice program:

[38, 53, 47, 42, 58, 42, 47, 56, 48, 57, 49, 49, 47, 45, 43, 49, 52, 55, 62, 61]

(Where the first value is number of times 1 was rolled, second value is number of times 2 was rolled, etc.)

I, a stats-know-nothing, tried to calculate the standard deviation for this and was surprised to come up with zero. I thought that was only possible if all the values were identical, but apparently that's not the case.

The reason I'm confused is that the calculation doesn't let me make a statement like "X% of die roll values come up within Y of the mean, while W% of die rolls only come up within Z of the mean." And I thought that was the point.

(to fill in a more specific value, e.g., I was expecting to be able to say something like "with a mean of 50 for how many times a given value is rolled, 68% of roll values appear within +/- 5 times of the mean, while 95% of die rolls come up within +/- 10 of the mean.")

What am I misunderstanding? Why do I only get zero and then have no further insights?

Best Answer

An elaboration of @Dave's Answer (+1): You have data in 'frequency-value` format. (It is more compact than listing the $n=1000$ individual die faces observed.) If the $k = 20$ values are $v_i = i,$ for $i=1$ through $k.$ and the corresponding frequencies are $f_i,$ then the sample size is $n = \sum_{i=1}^k f_i,$ the sample mean is $A = \bar X = \frac 1n\sum_{i=1}^k f_iv_i,$ the sample variance is $S^2 = \frac{1}{n-1}\sum_{i=1}^k f_i(v_i - a)^2,$ and the sample standard deviation is $S = \sqrt{S^2}.$

In R:

f=c(38, 53, 47, 42, 58, 42, 47, 56, 48, 57, 
    49, 49, 47, 45, 43, 49, 52, 55, 62, 61)
n = sum(f);  n
[1] 1000
v = 1:20
a = sum(f*v)/sum(f);  a
[1] 10.843
s.sq = sum(f*(v-mu)^2)/(n-1)
[1] 33.84219
s = sqrt(s.sq);  s
[1] 5.817404

Based on these data you could make a 95% confidence interval for the true population mean $\mu$ of the form $\bar X \pm 1.96\sigma/\sqrt{n}.$ In particular, $10.843 \pm 1.96(5.8174)/\sqrt{1000}$ or $(10.48, 11.20),$ which does include the true value $\mu = 10.5,$ see theoretical computation below. [The idea of the "95%" is that, over the long run, for repeated samples of size $n = 1000,$ 95 in 100 confidence intervals will include $\mu,$ as happened here.]

pm = (-1,1)
a + pm*1.96*s/sqrt(n)
[1] 10.48181 11.20419

Another simulated sample (from R) yields the 95% confidence interval $(9.98,10.69),$ which also includes $\mu = 10.5.$

set.seed(2020)
x = sample(1:20, 1000, repl=T)
a = mean(x);  a
[1] 10.334
s = sd(x);  s
[1] 5.751306

For a single roll of a fair 20 sided die, $\mu = E(X) = 10.5, \sigma^2 = Var(x) = 33.25,$ and $\sigma = SD(X) = 5.7663.$ Thus, the sample values for $n=1000$ rolls of this die are a reasonable match to the theoretical values.

p = rep(1/20, 20)
v = 1:20
mu = sum(p*v);  mu
[1] 10.5
sgm.sq = sum(p*(v-mu)^2);  sgm.sq
[1] 33.25
sgm = sqrt(sgm.sq);  sgm
[1] 5.766281

For a million rolls the match is even closer (about two decimal places):

set.seed(823)
x = sample(1:20, 10^6, repl=T)
a = mean(x);  a
[1] 10.49616
s = sd(x);  s
[1] 5.764575

Addendum re Comments on distribution of mean of 1000 rolls of your 20-sided die. The simulation shows results from a million 1000-roll experiments.

set.seed(1234)
a = replicate(10^6, mean(sample(1:20, 1000, rep=T)))
summary(a); sd(a)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  9.554  10.377  10.500  10.500  10.623  11.337 
[1] 0.1822281  # SD(A)
hist(a, prob=T, br=30, col="skyblue2")
 curve(dnorm(x,mean(a), sd(a)), add=T, col="red", lwd=2)

enter image description here