Variance – How to Estimate Variance of a Population If Population Mean is Known

samplevariance

I know that we use $\frac1{n-1}\sum\limits_i(x_i – \bar{x})^2$ to estimate the variance of a population. I remember a video from Khan Academy where the intuition given was that our estimated mean is probably a bit off the actual one so the distances $x_i – \bar{x}$ would actually be greater, so we divide by less ($n-1$ instead of $n$) to get a greater value, resulting in a better estimate.
And I remember reading somewhere, that I don't need this correction if I have the actual population mean $\mu$ instead of $\bar{x}$. So I would estimate $\frac1{n}\sum\limits_i(x_i – \mu)^2$
But I can't find it anymore. Is it true? Can someone give me a pointer?

Best Answer

Yes, it is true. In the language of statistics, we would say that if you have no knowledge of the population mean, then the quantity

$$\frac{1}{n-1} \sum_{i=1}^n \left(x_i-\bar{x} \right)^2$$

is unbiased, which simply means that it estimates the population variance correctly on average. But if you do know the population mean, there is no need to use an estimate for it- this is what the $\bar{x}$ serves for-and the finite-sample correction that comes with it.

In fact, it can be shown that the quantity

$$\frac{1}{n} \sum_{i=1}^n \left(x_i-\mu \right)^2$$

is not only unbiased but also has lower variance than the quantity above. This is quite intuitive as part of the uncertainty has now been removed. So we use this one in this situation.

It is worth noting that the estimators will differ very little in large sample sizes and hence they are asymptotically equivalent.

Related Solutions

Solved – Variance of a sample of random variables

This is a problem very often encountered in biology where they do a couple of independent experiments (100 in your case) sampled IID, each with their unknown own mean. The only thing they can do is estimate those means by again IID sampling. Typically, the variable $X_i$ is estimated by a sample of size $n_i$, so the variance will be $\sigma_i^2 = \sigma^2/n_i$, where $\sigma^2$ is the sampling variance, not the variance of $X$. Because each individual experiment can be written in the form $X_i + \varepsilon_{ij}$, the variance of that variable is $V + \sigma^2/n_i$, where $V$ is the variance of $X$.

You can compute the grand mean as $\bar{X} = \frac{1}{n}\sum_{i=1}^{100}n_i\bar{X_i}$, (where $n = \sum_{i=1}^{100}n_i$) which is a sum of the independent variables $\frac{n_i}{n}\bar{X_i}$. Their variance is $\frac{n_i^2}{n^2}V+\frac{n_i}{n^2}\sigma^2$, so by summing you get $\sum_{i=1}^{100}\frac{n_i^2}{n^2}V+\sigma^2/n$.

Solved – Estimating population variance through simulation in R

Your code has some errors. I increased the number of iterations in your simulation (to 10k) to get a better approximation of where the simulated distribution is centered. The biggest problem is that in your code, you generate two different sets of data. One set is used for the data, and the other set is used to calculate the mean. To further deepen your understanding, you may also want to try using the known population mean and not using Bessel's correction to estimate the variance. Here is some code:

set.seed(200)
B          <- 10000
varianceN  <- vector(length=B)
varianceN1 <- vector(length=B)
varianceP  <- vector(length=B)
for (i in 1:B) {
  data          <- rexp(5, 0.2)
  varianceN[i]  <- sum((data-mean(data))^2) / (5)
  varianceN1[i] <- sum((data-mean(data))^2) / (5-1)
  varianceP[i]  <- sum((data-5)^2)          / (5)
}
# N.b., the theoretically correct value for the population variance is 25
mean(varianceN)   # [1] 19.85737
mean(varianceN1)  # [1] 24.82172
mean(varianceP)   # [1] 24.85525

Best Answer

Related Solutions

Solved – Variance of a sample of random variables

Solved – Estimating population variance through simulation in R

Related Question