[Math] unbiased estimator of sample variance using two samples

probabilitysamplingstatistics

I have a couple questions, I'm hoping someone can help!
Let $X_1…X_n$ is a random i.i.d. sample from a $N(\mu,\sigma^2)$ distribution, and $Y_1…Y_m$ is a random i.i.d. sample from a $N(2\mu,\sigma^2)$ distribution, and further let the two samples be independent (and the quantities $\mu$ and $\sigma^2$ be unknown).
I'm trying to do the following: construct an unbiased estimator of $\mu$ ($\hat{\mu}$) using both samples, calculate $Var(\hat{\mu})$, and then use both samples to obtain an unbiased estimator for $\sigma^2$.

I think I understand the first two parts: we know $\large E(\frac{X_1+…+X_n}{n}) = \mu$, and $\large E(\frac{Y_1+…+Y_m}{m}) = 2\mu$, so I believe $\large \frac{X_1+…+X_n}{2n}+\frac{Y_1+…+Y_m}{4m}$ should provide an unbiased estimator for $\mu$, and from that it

follows $\large Var(\hat{\mu})=\sigma^2(\frac{1}{4n}+\frac{1}{16m})$.

What I'm not clear on is how to construct an unbiased estimator for the variance. I'm aware that $\large\frac{1}{n-1}\sum_{i=1}^n(X_i-\bar{X})^2$ provides an unbiased estimator for $\sigma^2$ (the proof is on wikipedia). From this, it seems like $\large\frac{1}{2}\cdot\frac{1}{n-1}\sum_{i=1}^n(X_i-\bar{X})^2$+$\large\frac{1}{2}\cdot\frac{1}{n-1}\sum_{i=1}^m(Y_i-\bar{Y})^2$ would yield $\frac{\sigma}{2}+\frac{\sigma}{2}=\sigma$, but something about it makes me nervous, and I feel like is approach may be inherently flawed? Any help/suggestions would be greatly appreciated!
Thanks

Best Answer

Apart from the fact that it should be $m-1$ instead of $n-1$ in the right-hand denominator, your estimator for $\sigma^2$ looks fine. You can do slightly better on the variance of $\hat\mu$ (though the question didn't ask to optimize it): Consider a general convex combination

$$ \alpha\frac{X_1+\dotso+X_n}n+(1-\alpha)\frac{Y_1+\dotso+Y_m}{2m} $$

of the individual estimators for $\mu$. The variance of this combined estimator is

$$ n\left(\frac\alpha n\right)^2\sigma^2+m\left(\frac{1-\alpha}{2m}\right)^2\sigma^2=\left(\frac{\alpha^2}n+\frac{(1-\alpha)^2}{4m}\right)\sigma^2\;, $$

and minimizing this by setting the derivative with respect to $\alpha$ to zero leads to $\alpha=n/(n+4m)$, yielding the variance $\sigma^2/(n+4m)$. For $n=m$ the variance is $\frac15\sigma^2/n=0.2\sigma^2/n$, compared to $\frac5{16}\sigma^2/n\approx0.3\sigma^2/n$ for your estimator, and for $n$ fixed and $m\to\infty$ or vice versa, the variance of this estimator tends to zero whereas the variance of your estimator tends to a non-zero value.

You could optimize the variance of your unbiased variance estimator in a similar way, though the calculation would be a bit more involved.

Related Solutions

[Math] unbiased estimator for sample covariance

You are correct. If you cannot match up realizations from $X$ with realizations from $Y$, then it is impossible to estimate how $X$ and $Y$ vary together; i.e., their covariance. What is required to estimate covariance are pairs of realizations between the variables.

The estimator you cite is the pooled variance of two samples assumed to be drawn from distributions with possibly different means, but with the same variance. That is to say, your $S_p^2$ is an estimator of $\sigma^2$ if $X$ and $Y$ are independent and normally distributed with different means but the same variance. But if $X$ and $Y$ are marginal distributions from a bivariate normal with unknown mean vector $\boldsymbol \mu = (\mu_x, \mu_y)$ and covariance matrix $$\boldsymbol \Sigma = \begin{bmatrix} \sigma_x^2 & \sigma_{xy} \\ \sigma_{xy} & \sigma_y^2 \end{bmatrix},$$ where in your case we might assume $\sigma_x^2 = \sigma_y^2 = \sigma^2$, then $S_p^2$ does not in any way estimate $\sigma_{xy}$. In fact, such a sample cannot estimate the covariance for the reason given in the previous paragraph.

[Math] unbiased pool estimator of variance

First, your notation for the sample variance seems to be muddled. The sample variance is ordinarily defined as $S^2 = \frac{1}{n-1}\sum_{i=1}^n (X_i - \bar X)^2,$ which makes it an unbiased estimator of the population variance $\sigma^2.$

Perhaps the most common context for 'unbiased pooled estimator' of variance is for the 'pooled t test': Suppose you have two random samples $X_i$ of size $n$ and $Y_i$ of size $m$ from populations with the same variance $\sigma^2.$ Then the pooled estimator of $\sigma^2$ is

$$S_p^2 = \frac{(n-1)S_X^2 + (m-1)S_Y^2}{m+n-2}.$$

This estimator is unbiased.

Because one says the samples have respective 'degrees of freedom' $n-1$ and $m-1,$ one sometimes says the $S_p^2$ is a 'degrees-of-freedom' weighted average of the two sample variances. If $n = m,$ then $S_p^2 = 0.5S_x^2 + 0.5S_Y^2.$

Note: Some authors do define the sample variance as $\frac{1}{n}\sum_{i=1}^n (X_i - \bar X)^2,$ but then the sample variance is not an unbiased estimator of $\sigma^2,$ even though it might have other properties desirable for the author's task at hand. However, most agree that the notation $S^2$ is reserved for the version with $n-1$ in the denominator, unless a specific warning is given otherwise.

Example: One common measure of the 'goodness' of an estimator is that it have a small 'root mean squared error'. If $T$ is an estimate of $\tau$ then $\text{MSE}_T(\tau) = E[(T-\tau)^2]$ and RMSE is its square root.

The simulation below illustrates for normal data with $n = 5$ and $\sigma^2 = 10^2 = 100,$ that the version of the sample variance with $n$ in the denominator has smaller RMSE than the version with $n-1$ in the denominator. (A formal proof for $n > 1$ is not difficult.)

set.seed(1888);  m = 10^6;  n = 5;  sigma = 10;  sg.sq = 100
v.a = replicate(m, var(rnorm(n, 100, sigma)))  # denom n-1
v.b = (n-1)*v.a/n                              # denom n
mean(v.a); RMS.a = sqrt(mean((v.a-sg.sq)^2)); RMS.a
[1] 100.0564  # 70.81563
[1] 70.81563  # larger RMSE
mean(v.b); RMS.b = sqrt(mean((v.b-sg.sq)^2)); RMS.b
[1] 80.0451   # biased   
[1] 60.06415  # smaller RMSE

Best Answer

Related Solutions

[Math] unbiased estimator for sample covariance

[Math] unbiased pool estimator of variance

Related Question