Unbiased estimator of variance from one data point

statistics

Let $X_1,…,X_n\sim\mathcal{N}(0,\sigma^2)$. I know that the sample variance
$$\hat{\sigma}^2=\frac{1}{n-1}\sum_{i=0}^{n}X_{i}^2$$
of this data is the UMVUE for $\sigma^2$.

But what if we have only a single data point? $\hat{\sigma}^2$ is not defined because of the $n-1 $ factor in the denominator. I'm wondering whether there exists some other unbiased estimator for $\sigma^2$ in this case.

It might seem like a weird question, but I think I've constructed such an estimator for the standard deviation $\sigma$. We know that $E(|X_1|)=\sigma\sqrt{2/\pi}$, so rearranging gives $E(|X_1|\sqrt{\pi/2})=\sigma$.

Best Answer

Dividing by $n-1$ instead of $n$ is called Bessel's correction. You should only be using Bessel's correction when you are also subtracting by the sample mean. If you already know the population mean and the sample mean is unnecessary then you shouldn't be using the correction (see the Wikipedia page).

In the case where you didn't know the population mean you'd need to subtract the sample mean which in the one data point case would be that single data point. You would then end up with $0/0$ which is undefined and that makes sense intuitively since one point can't tell you anything about the variability of the population.

Related Solutions

[Math] unbiased estimator of sample variance using two samples

Apart from the fact that it should be $m-1$ instead of $n-1$ in the right-hand denominator, your estimator for $\sigma^2$ looks fine. You can do slightly better on the variance of $\hat\mu$ (though the question didn't ask to optimize it): Consider a general convex combination

$$ \alpha\frac{X_1+\dotso+X_n}n+(1-\alpha)\frac{Y_1+\dotso+Y_m}{2m} $$

of the individual estimators for $\mu$. The variance of this combined estimator is

$$ n\left(\frac\alpha n\right)^2\sigma^2+m\left(\frac{1-\alpha}{2m}\right)^2\sigma^2=\left(\frac{\alpha^2}n+\frac{(1-\alpha)^2}{4m}\right)\sigma^2\;, $$

and minimizing this by setting the derivative with respect to $\alpha$ to zero leads to $\alpha=n/(n+4m)$, yielding the variance $\sigma^2/(n+4m)$. For $n=m$ the variance is $\frac15\sigma^2/n=0.2\sigma^2/n$, compared to $\frac5{16}\sigma^2/n\approx0.3\sigma^2/n$ for your estimator, and for $n$ fixed and $m\to\infty$ or vice versa, the variance of this estimator tends to zero whereas the variance of your estimator tends to a non-zero value.

You could optimize the variance of your unbiased variance estimator in a similar way, though the calculation would be a bit more involved.

[Math] unbiased pool estimator of variance

First, your notation for the sample variance seems to be muddled. The sample variance is ordinarily defined as $S^2 = \frac{1}{n-1}\sum_{i=1}^n (X_i - \bar X)^2,$ which makes it an unbiased estimator of the population variance $\sigma^2.$

Perhaps the most common context for 'unbiased pooled estimator' of variance is for the 'pooled t test': Suppose you have two random samples $X_i$ of size $n$ and $Y_i$ of size $m$ from populations with the same variance $\sigma^2.$ Then the pooled estimator of $\sigma^2$ is

$$S_p^2 = \frac{(n-1)S_X^2 + (m-1)S_Y^2}{m+n-2}.$$

This estimator is unbiased.

Because one says the samples have respective 'degrees of freedom' $n-1$ and $m-1,$ one sometimes says the $S_p^2$ is a 'degrees-of-freedom' weighted average of the two sample variances. If $n = m,$ then $S_p^2 = 0.5S_x^2 + 0.5S_Y^2.$

Note: Some authors do define the sample variance as $\frac{1}{n}\sum_{i=1}^n (X_i - \bar X)^2,$ but then the sample variance is not an unbiased estimator of $\sigma^2,$ even though it might have other properties desirable for the author's task at hand. However, most agree that the notation $S^2$ is reserved for the version with $n-1$ in the denominator, unless a specific warning is given otherwise.

Example: One common measure of the 'goodness' of an estimator is that it have a small 'root mean squared error'. If $T$ is an estimate of $\tau$ then $\text{MSE}_T(\tau) = E[(T-\tau)^2]$ and RMSE is its square root.

The simulation below illustrates for normal data with $n = 5$ and $\sigma^2 = 10^2 = 100,$ that the version of the sample variance with $n$ in the denominator has smaller RMSE than the version with $n-1$ in the denominator. (A formal proof for $n > 1$ is not difficult.)

set.seed(1888);  m = 10^6;  n = 5;  sigma = 10;  sg.sq = 100
v.a = replicate(m, var(rnorm(n, 100, sigma)))  # denom n-1
v.b = (n-1)*v.a/n                              # denom n
mean(v.a); RMS.a = sqrt(mean((v.a-sg.sq)^2)); RMS.a
[1] 100.0564  # 70.81563
[1] 70.81563  # larger RMSE
mean(v.b); RMS.b = sqrt(mean((v.b-sg.sq)^2)); RMS.b
[1] 80.0451   # biased   
[1] 60.06415  # smaller RMSE

Best Answer

Related Solutions

[Math] unbiased estimator of sample variance using two samples

[Math] unbiased pool estimator of variance

Related Question