[Math] The relationship between sample variance and proportion variance

statisticsvariance

I'm trying to see the relationship between the sample variance equation

$\sum(X_i- \bar X)^2/(n-1)$ and the variance estimate, $\bar X(1-\bar X),$ in case of binary samples.

I wonder if the outputs are the same, or if not, what is the relationship between the two??

I'm trying to prove their relationship but it's quite challenging to me..

Please help!

Sigma(Xi-Xbar)/(n-1)
Xbar(1-Xbar)

Best Answer

I suppose your question is whether the two formulas give the same answer for binary data. Here is an example to illustrate that they are almost the same, but not exactly.

Suppose I have a sample of a thousand zeros and ones in which there are 283 ones. Then $\bar X = 283/1000 = 0.283.$ Thus, $\bar X(1-\bar X) = 0.283(1 - 0.283) = 0.202911.$

An alternate general formula for the sample variance of values $X_i$ is

$$S^2 = \frac{\sum_{i=1}^n X_i^2 - n \bar X^2}{n-1}.$$

In a binary sample $\sum_{i=1}^n X_i^2 = \sum_{i=1}^n X_i$, because $0^2 = 0$ and $1^2 = 1.$

Thus, the general formula gives $S^2 = \frac{283 - 1000(.283)^2}{999} = 0.2031141.$ If (as in the Comment by @A.S) the denominator were $n = 1000$ instead of $n-1=999,$ this would simplify to $$S^2 = 0.283 - 0.283^2 = 0.283(1 = 0.283) = \bar X(1- \bar X).$$

The formula for the population variance is often written with the population size $n$ in the denominator.