Pooled Standard Deviation – How to Calculate Pooled Standard Deviation of Means

mathematical-statisticspoolingstandard deviationstandard errorvariance

I am reading the book Statistical Methods In Online A/B Testing.
I have two questions:

1.

Please Consider the scenario, an A/B test in which the variance of A and B groups are assumed to be same, and the conversions Vs users is recorded.

  • For group A, n1 = # observations(users) in A, p1=(# conversions)/(# users) for A.
  • For group B, n2 = # observations(users) in B, p2=(# conversions)/(# users) for B.

For this in the book it is given that the pooled standard deviation of the means is calculated by:

$$\sqrt{p*(1-p)*(\frac{1}{n_1} + \frac{1}{n_2}})$$
where
$$p=({p_1*n_1 + p_2*n_2})/(n_1+n_2)$$
Could anyone please tell how this is derived? It seems to be related to binomial distribution and pooled variance. I cannot figure out how to derive it.

2.

For difference in means other than proportions, in the book it is given that the pooled standard deviation of means is
$$\sqrt{(\sigma_1^2*(n_1-1) + \sigma_2^2*(n_2-1))/(n_1+n_2-1)}$$
where the $\sigma_1$ and $\sigma_1$ are sample variance of A and B respectively. $n_1$ and $n_1$ are sample population of A and B respectively.

Shouldn't the formula be
$$\sqrt{\sigma^2/n_1 + \sigma^2/n_2}$$
where $$\sigma=({\sigma_1*n_1 + \sigma_2*n_2})/(n_1+n_2)$$ Explanation for the Alternate formula given by me, since A and B groups have same variance we can use pooled variance to calculate $\sigma$. The standard error of mean for group A and B are $\sigma/\sqrt{n1}$ and $\sigma/\sqrt{n2}$ respectively. And since by central limit theorem the mean follows a normal distribution, the standard deviation of the difference is square root of sum of variances of A and B.
Please correct me if it is wrong.

Best Answer

In the first situation, in two groups $i\in\{1,2\}$ of $n_i$ binary responses you received $K_i$ positive responses and $P_i = K_i/n_i$ is the proportion. (I use capital letters to denote random variables.) Equivalently, $K_i = P_i n_i.$

Under the null hypothesis, each response is independently random and the chance of a positive result is $\pi,$ say. Consequently $K_1 + K_2$ has a Binomial$(n_1+n_2,\pi)$ distribution and you may estimate $\pi$ with the overall fraction of positives

$$\hat\pi = P = \frac{K_1+K_2}{n_1+n_2} = \frac{P_1n_1 + P_2n_2}{n_1+n_2}.$$

Still assuming the null, each $K_i$ independently follows a Binomial$(n_i,\pi)$ distribution and therefore has a variance of $n_i\pi(1-\pi).$ The sample means are $P_i=K_i/n_i.$ The variance of their difference therefore is

$$\begin{aligned} \operatorname{Var}\left(P_2-P_1\right)&= \operatorname{Var}\left(\frac{K_2}{n_2}-\frac{K_1}{n_1}\right)\\&= \frac{1}{n_2^2}\operatorname{Var}(K_2) + \frac{1}{n_1^2}\operatorname{Var}(K_1)\\&= \frac{n_2\pi(1-\pi)}{n_2^2} + \frac{n_1\pi(1-\pi)}{n_1^2}\\&= \pi(1-\pi)\left(\frac{1}{n_2}+\frac{1}{n_1}\right). \end{aligned}$$

To apply this, you use your estimate $P=\hat\pi$ in place of $\pi.$ Plugging that in and taking the square root gives the pooling formula you quote,

$$\widehat{\operatorname{SD}}(P_2-P_1) = \sqrt{P(1-P)\left(\frac{1}{n_2}+\frac{1}{n_1}\right)}.$$


For the second question, because the sample variance is the sum of squared residuals divided by $n_i-1,$ multiplying by $n_i-1$ gives the sum of squared residuals. Under the null hypothesis all squared residuals are exchangeable, so you can add them up and divide by one less than their combined count, $n_1+n_2-1,$ to obtain an estimate of the variance based on all the data. This assumes both standard deviations were computed relative to the overall mean $P.$ When they are computed relative to their separate group means $P_i,$ then a different pooling formula is needed altogether.


These are all considerations of means and variances and therefore do not rely on the Central Limit Theorem or any unstated distributional assumptions.