[Math] Formula of combined variance of two data sets yields wrong output

meansvariance

I have some distribution from which I sample two datasets x1 and x2. I wanted to calculate their combined mean and variance by using these two formulas:

$$\bar X_c = \frac{n_1 \overline{X_1} + n_1 \overline{X_1}}{n_1 + n_2}$$

$${S_c}^2 = \frac{{{n_1}{S_1}^2 + {n_2}{S_2}^2 + {n_1}{{\left( {{{\overline X }_1} – {{\overline X }_c}} \right)}^2} + {n_2}{{\left( {{{\overline X }_2} – {{\overline X }_c}} \right)}^2}}}{{{n_1} + {n_2}}}$$

where $n$ is the number of samples of the dataset. The subscript $c$ indicates the combined values.

For testing purposes, I wanted to check if the formulas yield the same result as when stacking the two datasets to create $x3 = x1+x2$ and calculating the mean and variance of it. So I created a dummy dataset like this:

x1    x2

98    69
49    54
33    38
73     9
51

I calculated the means and variances just for $x1$ and $x2$, and then for the 2 combination methods. It yielded:

        x1      x2      x3      xC

mean    60.80   42.50   52.66   52.66

var     635.2   659.0   657.75  728.47

As you can see, the formula worked for the means, but fails to reproduce the correct variance (x3).

Can somebody tell me what I am doing wrong? Simple answers would be nice, as I am not a great mathematician.

Thank you!

Best Answer

This formula is for the sample variances. What you wrote in the row labeled var in the table is the estimate for the population variance based on the sample variance, which differs from the sample variance due to Bessel's correction.

Related Question