Solved – How to combine sample means and sample variances

meannormal distributionsampleself-studyvariance

Given a set of data, assumed normally distributed from 1 to 19. And two subsets of that data, one with the numbers below 10, and one with the numbers above:

subset1 = [1 2 3 4 5 6 7 8 9 9 8 7 6 5 4 3 2 1];
subset2 = [15 16 17 18 19 19 18 17 16 15 15 16 17 18 19 18 17 16];
data    = [subset1 subset2];

We can calculate the means ($\mu_1,\mu_2,\mu_{tot}$) and SD's ($\sigma_1,\sigma_2,\sigma_{tot}$) for all three vectors above and if we assume the distributions to be Gaussian, we get the following graph:

Gaussian distributions of the data sets

But now let's assume the data set gets lost before I was able to calculate the total mean and variance (I know, it's ridiculous). Only being left with $\mu_1,\mu_2$ , $\sigma_1,\sigma_2$ and the lengths of the subsets $n_1,n_2$.

Could I reconstruct the total pdf with the parameters of the subsets?

The mean is rather easy: $\hat{\mu}=\frac{n_1\mu_1+n_2\mu_2}{n_1+n_2}=\mu_{tot}$

But what about the variance $\sigma_{tot}$? I read about pooled variance, but this is not depending on the means of the subsets and can't therefore possibly be true, since the spacing of the two "subgaussians" is obviously of influence.

[EDIT]

As a matter of fact, there is!

How to calculate pooled variance of two groups given known group variances, means, and sample sizes?

I apologize. Follow-up question:

Does the same hold for multivariate pdf's?

Best Answer

For multivariate normal distributions one can derive it in the same way and end up with:

$ (m+n)(\Sigma_{tot}+\underline{\mu}_{tot}\underline{\mu}_{tot}^T) = m(\Sigma_1+\underline{\mu}_{1}\underline{\mu}_{1}^T)+n(\Sigma_2+\underline{\mu}_{2}\underline{\mu}_{2}^T)$

Rearranging the symbols will give you the formula to calculate the total covariance.

Related Question