I have some distribution from which I sample two datasets x1
and x2
. I wanted to calculate their combined mean and variance by using these two formulas:
$$\bar X_c = \frac{n_1 \overline{X_1} + n_1 \overline{X_1}}{n_1 + n_2}$$
$${S_c}^2 = \frac{{{n_1}{S_1}^2 + {n_2}{S_2}^2 + {n_1}{{\left( {{{\overline X }_1} – {{\overline X }_c}} \right)}^2} + {n_2}{{\left( {{{\overline X }_2} – {{\overline X }_c}} \right)}^2}}}{{{n_1} + {n_2}}}$$
where $n$ is the number of samples of the dataset. The subscript $c$ indicates the combined values.
For testing purposes, I wanted to check if the formulas yield the same result as when stacking the two datasets to create $x3 = x1+x2$ and calculating the mean and variance of it. So I created a dummy dataset like this:
x1 x2
98 69
49 54
33 38
73 9
51
I calculated the means and variances just for $x1$ and $x2$, and then for the 2 combination methods. It yielded:
x1 x2 x3 xC
mean 60.80 42.50 52.66 52.66
var 635.2 659.0 657.75 728.47
As you can see, the formula worked for the means, but fails to reproduce the correct variance (x3
).
Can somebody tell me what I am doing wrong? Simple answers would be nice, as I am not a great mathematician.
Thank you!
Best Answer
This formula is for the sample variances. What you wrote in the row labeled
var
in the table is the estimate for the population variance based on the sample variance, which differs from the sample variance due to Bessel's correction.