Solved – Bounds for the population variance

boundsmathematical-statisticssamplevariance

Suppose we have i.i.d. samples $x_1$, $\ldots$, $x_n$ for a (potentially non-normal) random variable $X$ with finite moments. We can use these samples to construct an unbiased estimates of the population mean and population variance
$$
\bar{x} = n^{-1} \sum_{i=1}^n x_i \qquad\text{and}\qquad s^2 = \frac{1}{n-1} \sum_{i=1}^n (x_i – \bar{x})^2 \enspace.
$$
Without making any assumptions on the distribution of $X$, it is possible to construct probabilistic bounds on the population mean, by using Chebyshev's inequality (see, e.g., wikipedia or the original paper).

My question is: do such probabilistic bounds exist for the population variance? In other words, can we say that with probability $\delta$ the population variance $\sigma^2$ will be in some interval $[L(\delta,\{x_i\}),U(\delta,\{x_i\})]$? And if so, what are the functions $L$ and $U$ that describe the lower and upper bound?

For normal distributions the sample variance follows a $\sigma^2 \chi^2_{n-1} (n-1)^{-1}$ distribution. This can be used to construct confidence intervals. However, I am looking for more general bounds that apply also to non-normal settings.

Best Answer

The general asymptotic result for the asymptotic distribution of the sample variance is (see this post)

$$\sqrt n(\hat v - v) \xrightarrow{d} N\left(0,\mu_4 - v^2\right)$$

where here, I have used the notation $v\equiv \sigma^2$ to avoid later confusion with squares, and where $\mu_4 = \mathrm{E}\left((X_i -\mu)^4\right)$. Therefore by the continuous mapping theorem

$$\frac {n(\hat v - v)^2}{\mu_4 - v^2} \xrightarrow{d} \chi^2_1 $$

Then, accepting the approximation,

$$P\left(\frac {n(\hat v - v)^2}{\mu_4 - v^2}\leq \chi^2_{1,1-a}\right)=1-a$$

The term in the parenthesis will give us a quadratic equation in $v$ that will include the unknown term $\mu_4$. Accepting a further approximation, we can estimate this from the sample. Then we will obtain

$$P\left(Av^2 + Bv +\Gamma\leq 0 \right)=1-a$$

The roots of the polynomial are

$$v^*_{1,2}= \frac {-B \pm \sqrt {B^2 -4A\Gamma}}{2A}$$

and our $1-a$ confidence interval for the population variance will be

$$\max\Big\{0,\min\{v^*_{1,2}\}\Big\}\leq \sigma^2 \leq \max\{v^*_{1,2}\}$$

since the probability that the quadratic polynomial is smaller than zero, equals (in our case, where $A>0$) the probability that the population variance lies in between the roots of the polynomial.


Monte Carlo Study

For clarity, denote $\chi^2_{1,1-a}\equiv z$.

A little algebra gives us that

$$A = n+z, \;\;\ B = -2n\hat v,\;\; \Gamma = n\hat v^2 -z \hat \mu_4$$

which leads to

$$v^*_{1,2}= \frac {n\hat v \pm \sqrt {nz(\hat \mu_4-\hat v^2)+z^2\hat \mu_4}}{n+z}$$

For $a=0.05$ we have $\chi^2_{1,1-a}\equiv z = 3.84$

I generated $10,000$ samples each of size $n=100$ from a Gamma distribution with shape parameter $k=3$ and scale parameter $\theta = 2$. The true mean is $\mu = 6$, and the true variance is $v=\sigma^2 =12$.

Results:
The sample distribution of the sample variance had a long road ahead to become normal, but this is to be expected for the small sample size chosen. Its average value though was $11.88$, pretty close to the true value.

The estimation bound was smaller than the true variance, in $1,456$ samples, while the lower bound was greater than the true variance only $17$ times. So the true value was missed by the $CI$ in $14.73$% of the samples, mostly due to undershooting, giving a confidence level of $85$%, which is a $~10$ percentage points worsening from the nominal confidence level of $95$%.

On average the lower bound was $7.20$, while on average the upper bound was $15.68$. The average length of the CI was $8.47$. Its minimum length was $2.56$ while its maximum length was $34.52$.

Related Question