Solved – Weighted standard deviation of average

rstandard deviationweighted mean

For two groups (a and b) with unequal numbers of observations (n=10 and n=30), I have given their means and sd. Calculating the weighted mean is pretty straight forward. But how can I derive the sd of this weighted mean? Please consider the following dummy data set:

a <- rnorm(10,1,1)
b <- rnorm(30,1,1)
weight_a <- 1/4
weight_b <- 3/4

# Calculate weighted mean
weighted_mean <- mean(a)*weight_a + mean(b)*weight_b
# this gives the same average as:
mean(c(a,b))

# Calculate weighted sd
weighted_sd <- sqrt(sd(a)^2*weight_a+sd(b)^2*weight_b)
# this does not give the same sd as:
sd(c(a,b))

What am I missing?

Best Answer

You're confusing addition of random variables with concatenation of samples (easy to do, it took me a while to realize why your code didn't work!). So for independent random variables $X$ and $Y$ you can write $\rm{Var}(aX+bY) = a^2\rm{Var}(X) + b^2\rm{Var}(Y)$, noting the coefficients are squared. This would apply (approximately) to samples as follows

set.seed(1)
n <- 100
mu = 1
sigma = 1
a<-rnorm(n,mu,sigma)
b<-rnorm(n,mu,sigma)
weight_a <- 1/4
weight_b <- 3/4
sqrt(sd(a)^2*weight_a^2+sd(b)^2*weight_b^2)
[1] 0.7526849
sd(weight_a*a + weight_b*b)
[1] 0.7524718

For concatenation of samples you need to apply the formula for variance $\rm{Var}(X) = E(X^2) - (E(X))^2$, adjusted appropriately for the fact that you're using samples, i.e. multiply by $n/(n-1)$. This is illustrated with R code below.

suma2 = sum(a^2)
sumb2 = sum(b^2)
suma = sum(a)
sumb = sum(b)
sumc2 = suma2 + sumb2
sumc = suma + sumb
nc = n + n
(sumb2/n - (sumb/n)^2) * n/(n-1)
[1] 0.9175323
sd(b)^2
[1] 0.9175323
sumc2 = suma2 + sumb2
sumc = suma + sumb
(sumc2/nc - (sumc/nc)^2) * nc/(nc-1)
[1] 0.8632217
sd(c(a,b))^2
[1] 0.8632217

When you are concatenating weighted series, the weights can be incorporated naturally to give the formula.

$$ \rm{Var}(\mathbf{w}) = \left(\frac{w_a^2 S_{aa} + w_b^2 S_{bb}}{n_a + n_b} - \left(\frac{w_a S_a + w_b S_b}{n_a + n_b}\right)^2 \right) \times \frac{n_a + n_b}{n_a + n_b -1}, $$

where $\mathbf{w}$ is the series formed by concatenating series $\mathbf{a}$ of length $n_a$ weighted by $w_a$ and series $\mathbf{b}$ of length $n_b$ weighted by $w_b$. $S_{aa}$ is the sum of squares of series $\mathbf{a}$, $S_{a}$ is the sum of series $\mathbf{a}$, and likewise for $\mathbf{b}$ .

This is illustrated in the following R code.

sumw2 = weight_a^2*suma2 + weight_b^2*sumb2
sumw = weight_a*suma + weight_b*sumb
(sumw2/nc - (sumw/nc)^2) * nc/(nc-1)
[1] 0.3314697
sd(c(weight_a*a, weight_b*b))^2
[1] 0.3314697

Note that in the first case, addition, I had to use the same size of series. For convenience, I carried on with the same series. But the code used for the second case, concatenation, should work for different-sized series.

However, as @Peter Ellis states in his answer, the problem you are ultimately trying to solve may be something like a two-sample t test, where you assume that the two samples are from the same population, and so the underlying variance of each sample is the same. In this case, you want to estimate the population variance. The formula for the pooled variance is well-known and can be found on wikipedia (wikipedia also provides a generalization to multiple samples).