Solved – Is standard deviation useful for frequency of occurrence

binary datafrequencymeanrstandard deviation

Frequency of occurrence (FO) is a simple metric measuring the proportion of samples (often expressed as a percentage) where a certain item is present. It can be calculated as follows:

$FO= 100\% \times \frac{n}{N}$, where n is the number of samples where a certain item was observed and N the total number of samples.

For binary data, FO is equivalent to average of a binary vector multiplied by 100%. I.e:

x <- c(rep(1, 5), rep(0, 5))
x
# [1] 1 1 1 1 1 0 0 0 0 0
100*mean(x)
# [1] 50

Following this logic, it is possible to calculate standard deviation for the FO estimate:

100*sd(x)
# [1] 52.70463

Yet the standard deviation appears to be affected by the number of observations:

100*mean(c(rep(1,5*10^6), rep(0,5*10^6)))
# [1] 50
100*sd(c(rep(1,5*10^6), rep(0,5*10^6)))
# [1] 50

But does not seem to converge the FO estimate in every case:

100*mean(c(rep(1,2*10^6), rep(0,8*10^6)))
# [1] 20
100*sd(c(rep(1,2*10^6), rep(0,8*10^6)))
# [1] 40

My questions are:

1) What does standard deviation mean in practice for frequency of occurrence?

2) Is this metric or other variance derivates (standard error, confidence intervals) useful for expressing the uncertainty of a FO estimate?

Best Answer

The answer is edited based on the comment by @Gregor.

1) Standard deviation for frequency of occurrence (FO) is $\sqrt{p(1-p)}$ where p is FO/100 (i.e. the proportion). This holds for large samples (see the figure) as sample size affects the standard deviation (references: 1, 2). Using this equation one can find standard deviation for a range of FOs:

FO  sd
0   0
10  0.3
20  0.4
30  0.4582
40  0.4898
50  0.5
60  0.4898
70  0.4582
80  0.4
90  0.3
100 0

The convergence occurs practically at sample sizes > 20: enter image description here

2) Consequently SDs will be more dependent on the FO than sample size, and does not seem to be a useful metric for frequency of occurrence. Yet, Gregor points out that confidence intervals for proportions, which use variance, are useful. See this link for more information.