Solved – Estimating population variance through simulation in R

estimationrsimulationvariance

I want to estimate the variance of the exponential distribution with a rate of $\lambda=0.2$.

I'm drawing a sample of 5 exponentials 1000 times, and know that the theoretical variance of my 'population' should be:
$$
\sigma^2=\frac{1}{\lambda^2}=\frac{1}{0.2^2} = 25
$$
From my understanding, dividing the deviations from the mean by the sample size will give a biased estimate of the population variance, even if the calculation will technically provide the variance of the individual sample. Instead, I should be able to get an unbiased estimate of the variance by accounting for the n-1 degrees of freedom:
$$
s^2 = \sum\limits_{i=1}^n \frac{(x_i-\bar{x})^2}{N-1}
$$
I now want to explore this bias in R by running two simulations, one accounting for the reduced degrees of freedom and another in which the sum is only divided by $N$.

variance <- NULL
for (i in 1:1000) {
    variance <- c(variance, sum((rexp(5, 0.2) - mean(rexp(5, 0.2)))^2)/(5-1))
}
mean(variance)

However, my supposedly unbiased estimator, accounting for the fewer degrees of freedom, ends up being consistently more biased than the uncorrected version, which is expected to be biased.

For example with set.seed(200), I'm getting the following estimates of the population variance:

Dividing by n-1: 35.64
Dividing by n: 28.51

By comparison, I get much better results using R's built-in var() function. However, I would like to write out as many of the calculations by hand as possible, to help improve my understanding.

Am I misunderstanding the theory or am I misunderstanding the functions in my R code (or both)?

Best Answer

Your code has some errors. I increased the number of iterations in your simulation (to 10k) to get a better approximation of where the simulated distribution is centered. The biggest problem is that in your code, you generate two different sets of data. One set is used for the data, and the other set is used to calculate the mean. To further deepen your understanding, you may also want to try using the known population mean and not using Bessel's correction to estimate the variance. Here is some code:

set.seed(200)
B          <- 10000
varianceN  <- vector(length=B)
varianceN1 <- vector(length=B)
varianceP  <- vector(length=B)
for (i in 1:B) {
  data          <- rexp(5, 0.2)
  varianceN[i]  <- sum((data-mean(data))^2) / (5)
  varianceN1[i] <- sum((data-mean(data))^2) / (5-1)
  varianceP[i]  <- sum((data-5)^2)          / (5)
}
# N.b., the theoretically correct value for the population variance is 25
mean(varianceN)   # [1] 19.85737
mean(varianceN1)  # [1] 24.82172
mean(varianceP)   # [1] 24.85525
Related Question