Pooled Standard Deviation – What is the Pooled Standard Deviation of Paired Samples?

effect-sizestandard deviation

I am trying to do a priori sample size calculation based on published results. However, I am unable to obtain a reasonable estimate of the published effect size (which is not reported) as I am unable to obtain an estimate of the pooled standard deviation or the standard deviation of the difference.

The problem resides in the fact that it is a factorial experiment with two within-subjects factors ($2 \times 3$ levels). I only have the cell means and standard deviations (i.e., for the $2 \times 3$ table) but not the marginal means for the first factor (with 2 levels) which I need.

I know that the formula for the pooled standard deviation for independent samples (taken from wikipedia) is:

$$s_p = \sqrt{\frac{(n_1-1) s_1^2 + (n_2-1) s_2^2 + \cdots + (n_k-1) s_k^2}{n_1 + n_2 + \cdots + n_k – k}}.$$

But what is the formula for pooled standard deviation for dependent samples?


Means:

     1A   1B
2a 3.24 3.01
2b 2.91 2.56
2c 3.01 3.05

Standard deviations:

     1A   1B
2a 0.65 0.70
2b 0.68 0.60
2c 0.46 0.53

I want to obtain the effect size between 1A and 1B (so pool over levels of factor 2). Sample size is 27.

Best Answer

If this is a completely balanced within-subject design with $27\times 6=162$ observations, then you can actually calculate the marginal means: simply average over the levels of the second factor. Of course, you have to be sure that averaging over different conditions is meaningful for your planned experiment - do you expect each of those conditions to be present with about 1/3 probability?

The real difficulty is with the variance of the difference. It is well known that $$Var(X-Y) = Var(X) + Var(Y) - 2 SD(X)SD(Y)Corr(X,Y)$$ The problem is that you don't know the within-subject correlation.

Option 1. You could just to guess at a value: would you expect the correlation to be high or low? Since higher correlation will lead to lower variance, you could assume the worst case scenario of 0 correlation, and be guaranteed to overestimate the required sample size (unless the true correlation is negative, but that is rare).

Option 2. If the published results have more information, like a p-value from a test, you could try to figure out the correlation. For a complicated design like this one, it might be difficult to do analytically, but you could try a simulation approach. Given a correlation coefficient, simulate data with the given means and variances, run the test and check the p-value. Modify the correlation coefficient until you get close to the published result.

Related Question