Calculating Required Sample Size – Precision of Variance Estimate and Techniques

estimationrandom variablesample-sizesamplingvariance

Background

I have a variable with an unknown distribution.

I have 500 samples, but I would like demonstrate the precision with which I can calculate variance, e.g. to argue that a sample size of 500 is sufficient. I am also interested in knowing the minimum sample size that would be required to estimate variance with a precision of $X\%$.

Questions

How can I calculate

  1. the precision of my estimate of the variance given a sample size of $n=500$? of $n=N$?
  2. How can I calculate the minimum number of samples required to estimate the variance with a precision of $X$?

Example

Figure 1 density estimate of the parameter based on the 500 samples.

enter image description here

Figure 2 Here is a plot of sample size on the x-axis vs. estimates of variance on the y axis that I have calculated using subsamples from the sample of 500. The idea is that the estimates will converge to the true variance as n increases.

However, the estimates are not valid independent since the samples used to estimate variance for $n \in [10,125,250,500]$ are not independent of each other or of the samples used to calculate variance at $n\in [20,40,80]$

enter image description here

Best Answer

For i.i.d. random variables $X_1, \dotsc, X_n$, the unbiased estimator for the variance $s^2$ (the one with denominator $n-1$) has variance:

$$\mathrm{Var}(s^2) = \sigma^4 \left(\frac{2}{n-1} + \frac{\kappa}{n}\right)$$

where $\kappa$ is the excess kurtosis of the distribution (reference: Wikipedia). So now you need to estimate the kurtosis of your distribution as well. You can use a quantity sometimes described as $\gamma_2$ (also from Wikipedia):

$$\gamma_2 = \frac{\mu_4}{\sigma_4} - 3$$

I would assume that if you use $s$ as an estimate for $\sigma$ and $\gamma_2$ as an estimate for $\kappa$, that you get a reasonable estimate for $\mathrm{Var}(s^2)$, although I don't see a guarantee that it is unbiased. See if it matches with the variance among the subsets of your 500 data points reasonably, and if it does don't worry about it anymore :)