Pooled Standard Deviation – How to Calculate Pooled Standard Deviation of Means

mathematical-statisticspoolingstandard deviationstandard errorvariance

I am reading the book Statistical Methods In Online A/B Testing.
I have two questions:

1.

Please Consider the scenario, an A/B test in which the variance of A and B groups are assumed to be same, and the conversions Vs users is recorded.

For group A, n1 = # observations(users) in A, p1=(# conversions)/(# users) for A.
For group B, n2 = # observations(users) in B, p2=(# conversions)/(# users) for B.

For this in the book it is given that the pooled standard deviation of the means is calculated by:

$$\sqrt{p*(1-p)*(\frac{1}{n_1} + \frac{1}{n_2}})$$
where
$$p=({p_1*n_1 + p_2*n_2})/(n_1+n_2)$$
Could anyone please tell how this is derived? It seems to be related to binomial distribution and pooled variance. I cannot figure out how to derive it.

2.

For difference in means other than proportions, in the book it is given that the pooled standard deviation of means is
$$\sqrt{(\sigma_1^2*(n_1-1) + \sigma_2^2*(n_2-1))/(n_1+n_2-1)}$$
where the $\sigma_1$ and $\sigma_1$ are sample variance of A and B respectively. $n_1$ and $n_1$ are sample population of A and B respectively.

Shouldn't the formula be
$$\sqrt{\sigma^2/n_1 + \sigma^2/n_2}$$
where $$\sigma=({\sigma_1*n_1 + \sigma_2*n_2})/(n_1+n_2)$$ Explanation for the Alternate formula given by me, since A and B groups have same variance we can use pooled variance to calculate $\sigma$. The standard error of mean for group A and B are $\sigma/\sqrt{n1}$ and $\sigma/\sqrt{n2}$ respectively. And since by central limit theorem the mean follows a normal distribution, the standard deviation of the difference is square root of sum of variances of A and B.
Please correct me if it is wrong.

Best Answer

In the first situation, in two groups $i\in\{1,2\}$ of $n_i$ binary responses you received $K_i$ positive responses and $P_i = K_i/n_i$ is the proportion. (I use capital letters to denote random variables.) Equivalently, $K_i = P_i n_i.$

Under the null hypothesis, each response is independently random and the chance of a positive result is $\pi,$ say. Consequently $K_1 + K_2$ has a Binomial$(n_1+n_2,\pi)$ distribution and you may estimate $\pi$ with the overall fraction of positives

$$\hat\pi = P = \frac{K_1+K_2}{n_1+n_2} = \frac{P_1n_1 + P_2n_2}{n_1+n_2}.$$

Still assuming the null, each $K_i$ independently follows a Binomial$(n_i,\pi)$ distribution and therefore has a variance of $n_i\pi(1-\pi).$ The sample means are $P_i=K_i/n_i.$ The variance of their difference therefore is

$$\begin{aligned} \operatorname{Var}\left(P_2-P_1\right)&= \operatorname{Var}\left(\frac{K_2}{n_2}-\frac{K_1}{n_1}\right)\\&= \frac{1}{n_2^2}\operatorname{Var}(K_2) + \frac{1}{n_1^2}\operatorname{Var}(K_1)\\&= \frac{n_2\pi(1-\pi)}{n_2^2} + \frac{n_1\pi(1-\pi)}{n_1^2}\\&= \pi(1-\pi)\left(\frac{1}{n_2}+\frac{1}{n_1}\right). \end{aligned}$$

To apply this, you use your estimate $P=\hat\pi$ in place of $\pi.$ Plugging that in and taking the square root gives the pooling formula you quote,

$$\widehat{\operatorname{SD}}(P_2-P_1) = \sqrt{P(1-P)\left(\frac{1}{n_2}+\frac{1}{n_1}\right)}.$$

For the second question, because the sample variance is the sum of squared residuals divided by $n_i-1,$ multiplying by $n_i-1$ gives the sum of squared residuals. Under the null hypothesis all squared residuals are exchangeable, so you can add them up and divide by one less than their combined count, $n_1+n_2-1,$ to obtain an estimate of the variance based on all the data. This assumes both standard deviations were computed relative to the overall mean $P.$ When they are computed relative to their separate group means $P_i,$ then a different pooling formula is needed altogether.

These are all considerations of means and variances and therefore do not rely on the Central Limit Theorem or any unstated distributional assumptions.

1.

2.

Best Answer

Related Solutions

Pooled Standard Deviation – What is the Pooled Standard Deviation of Paired Samples?

Solved – Intuition for the standard error of the difference of sample means

Related Question