[Math] unbiased estimate of the variance of a weighted mean

st.statistics

I am trying to track down a formula (and derivation) for an unbiased estimate of the variance of a weighted mean.

Wikipedia provides such a formula:

http://en.wikipedia.org/wiki/Weighted_mean
However, the reference they cite (the GNU Scientific Library implementation comments):

a. has no derivation
b. states that the implementation is designed for the circumstance where the weight is a variance normalization. The circumstance I am interested in is where the weight is an arbitrary set of numbers which are positive and sum to 1.

Does anyone know of a derivation of the formula wikipedia gives, or related work?

Thanks,

SetJmp

Best Answer

First some notation. Each example is drawn from some unknown distribution $Y$ with $E[Y] = \mu$ and $\textrm{Var}[Y] = \sigma^2$. Suppose the weighted mean consists of $n$ independent draws $X_i\sim Y$, and $\{w_i\}_1^n$ is in the standard simplex. Finally define the r.v. $X = \sum_i w_i X_i$. Note that $E[X] = \sum_i w_i E[X_i] = \mu$ and $\textrm{Var}[X] = \sum_i w_i^2 \textrm{Var} [X_i] = \sigma^2\sum_i w_i^2$.

Generalizing the standard definition of sample mean, take $$ \hat \mu(\{x_i\}_1^n) := \sum_i w_i x_i. $$ Note that $E[\hat \mu(\{x_i\}_1^n)] = \sum_i w_i E[x_i] = \mu = E[X]$, so $\hat \mu$ is an unbiased estimator.

For the sample variance, generalize the sample variance as $$ \hat \sigma^2_b(\{x_i\}_1^n) := \sum_i w_i (x_i - \hat \mu(\{x_i\}_1^n))^2, $$ where the subscript foreshadows this will need a correction to be unbiased. Anyway, $$ E[\hat \sigma^2_b] = \sum_i w_i E[(x_i - \hat \mu)^2] = \sum_i w_i E\left[\left(\sum_j w_j (x_i - x_j)\right)^2\right]. $$ The term in the expectation can be written as $$ \sum_{j,k} w_j(x_i - x_j)w_k(x_i - x_k) = \sum_jw_j^2(x_i - x_j)^2 + \sum_{j\neq k} w_j w_k(x_i - x_j)(x_i - x_k). $$ Passing in the expectation, the first term (when $x_i\neq x_j$, which would yield 0) is $$ E[(x_i-x_j)^2] = 2E[x_i^2] - 2\mu^2 = 2\sigma^2, $$ whereas the second (when $x_i \neq x_j$ and $x_i \neq x_k$, which would yield 0) is $$ E[x_i^2 - x_ix_j - x_ix_k + x_jx_k] = E[x_i^2] - \mu^2 = \sigma^2. $$ Combining everything, $$ \sum_i w_i \left(2\sigma^2\sum_{j\neq i}w_j^2 + \sigma^2\sum_{j\neq k\neq i} w_j w_k\right) = \sigma^2( 1 - \sum_j w_j^2). $$ Therefore $E[\hat \sigma_b^2] - \sigma^2 = -\sigma^22\sum_j w_j^2$, i.e. this is a biased estimator. To make this an unbiased estimator of $Y$, divide by the excess term derived above: $$ \hat \sigma_u^2(\{x_i\}_1^n) := \frac {\hat \sigma_b^2(\{x_i\}_1^n)}{1- \sum_j w_j^2} = \frac {\sum_i w_i(x_i - \hat \mu)^2}{1- \sum_j w_j^2 } $$ This matches the definition you gave (and a sanity check $w_i = 1/N$, recovering the normal unbiased estimate).

Now, if one instead were to seek an unbiased estimator of $X=\sum_i X_i$, the formula would instead be $\hat \sigma_b^2(\{x_i\}_1^n)(\sum_j w_j^2) / ( 1 - \sum_j w_j^2)$.

It is very odd for me that the documents you refer to are making estimators of $Y$ and not $X$; I don't see the justification of such an estimator. Also it is not clearly how to extend it to samples that don't have length $n$, whereas for the estimator of $X$, you simply have some number $m$ of $n$-samples, and averaging everything above makes things work out. Also, I didn't check, but it's my suspicion that the weighted estimator for $Y$ has higher variance than the usual one; as such, why use this weighted estimator at all? Building an estimator for $X$ would seem to have been the intent..