Variance – Differences Between Pooled Variance and Combined Variance Explained

descriptive statisticspoolingvariance

Could someone explain for me what the difference between Combined variance and pooled variance is?
I have couple of groups (more than 2) with different sample size, I want to calculate the overall variance, Std, SE and CI… May I know which method is more appropriate for overall variance?

I found below formula:

$$S_c^2 = \frac{n_1[S_1^2+(\bar X_1 – \bar X_C)^2] +n_2[S_2^2+(\bar X_2 – \bar X_C)^2] }{n_1 + n_2}\,,$$

where $$\bar X_c=\frac{n_1\bar X_1+n_2\bar X_2}{n_1+n_2}$$

$$ S_p^2 = \frac{(n-1)S_x^2 + (m-1)S_y^2}{n + m -2}$$

Best Answer

The above commenter's statement is incorrect. There are two differences between the formula you are calling "combined" variance and the formula you are calling "pooled" variance. One difference is relatively minor, while another difference is more substantive. To homogenize notation, let me index my groups as 1 and 2 in the pooled variance case as well, so $$S_p^2 = \frac{(n_1-1)S_1^2 + (n_2 - 1) S_2^2}{n_1 + n_2 - 2}$$ A first difference between the pooled and combined variance formulas you have displayed is that the pooled variance formula given here contains what is known as a "degrees of freedom" correction. The point is that the usual sample variance formula divides by $n-1$ instead of $n$, as doing so produces an unbiased estimate for the population variance. When $n$ is large, this distinction hardly matters, so in discussing the second distinction, I will largely ignore it and consider the following version of the pooled variance formula $$S_p^2 = \frac{n_1 S_1^2 + n_2 S_2^2}{n_1 + n_2}$$ After making this change, it is pretty easy to see that this pooled variance formula is still not the same as the combined variance formula. To understand the distinction, it is useful to think about populations rather than samples. In population, a well known result in probability is the so-called law of total variance, which states that for random variables $Y,X$, $$\mathrm{Var}(X) = \mathbb E[\mathrm{Var}(X|Y)] + \mathrm{Var}(\mathbb E[X | Y])$$ In the context of pooled/combined variances, we can take $X$ to be the outcome which you are measuring and $Y$ to be which group somebody belongs to, so $Y = 1$ or $Y= 2$. The law of total variance would then read $$\begin{aligned}\mathrm{Var}(X) &= \mathrm{Var}(X | Y = 1) \mathrm{Pr}(Y = 1) + \mathrm{Var}(X | Y = 2) \mathrm{Pr}(Y = 2)\\ &= \left(\mathbb E[X | Y = 1] - \mathbb E[X]\right)^2\mathrm{Pr}(Y = 1) + \left(\mathbb E[X | Y = 2] - \mathbb E[X]\right)^2\mathrm{Pr}(Y = 2)\end{aligned}$$ Now, with this formula in mind, we can better understand the formula for $S_c^2$. Note the following:

  1. $n_1 / (n_1 + n_2)$ is the sample analogue of $\mathrm{Pr}(Y = 1)$ while $n_2 / (n_1 + n_2)$ is the sample analogue of $\mathrm{Pr}(Y = 1)$
  2. $S_1^2$ is the sample analogue for $\mathrm{Var}(X | Y = 1)$ while $S_2^2$ is the sample analogue for $\mathrm{Var}(X | Y = 2)$.
  3. $\bar X_1$ is the sample analogue of $\mathbb E[X | Y = 1]$ while $\bar X_2$ is the sample analogue of $\mathbb E[X | Y = 2]$.
  4. $\bar X_c$ is the sample analogue of $\mathbb E[X]$.

From this, you can see that $S_c^2$ simply decomposes the variance of the outcome $X$ in terms of its "within group" and "between group" components. By contrast, $S_p^2$ only includes the variance within group. Thus, $S_c^2$ and $S_p^2$ correspond to subtly different questions. In particular, $S_c^2$ asks "if I drew a random observation from the entire population (i.e. where my draws could randomly come from group 1 or group 2), what is the variance of that draw" whereas $S_p^2$ asks the question "if I drew a random observation from the population but also recorded whether the observation was from group 1 or from group 2, how much variance would the prediction error of that draw be if I took the mean from the respective group as my prediction?"

One other thing to add. Because $S_p^2$ and $S_c^2$ are related via a variance decomposition and variances can only be non-negative, this implies that (unless $n_1, n_2$ are very small), we should expect that to the extent that the two numbers are different, $S_c^2$ should be larger.