Solved – R power and sample size estimation

rsample-sizestatistical-power

I am tasked with estimating an appropriate sample size for a sales call center experiment. Two groups A & B will be taking calls. Group A (2/3 of the calls) will follow their normal procedure in selling the product. Group B (1/3 of calls) will be selling using a different strategy. I need to estimate how many calls we will need to observe in order to measure a significant difference of 0%, 1%, 5%, 10% in success rates for group A and group B. I have explored the pwr package using pwr.2p2n.test() function, but am not quite sure how to apply for my example.

Total calls per month for both groups will be between 35-50k per month. My thought was to have calls per month and p1 – p2 be variable inputs into pwr.2p2n.test() to get a range of power estimates, then choose the test that maximizes power.

Is this a flawed method?

Best Answer

Given my comments under your post above:

It sounds to be like you are analyzing a 2 x 2 contingency table: Group A vs. Group B x Success vs. Failure. With these, you can easily calculate an odds ratio (OR), see metafor::escalc() for good documentation on getting an OR from a 2 x 2 contingency table.

I have used epiR::epi.ccsize() to do power analyses for odds ratios before in working with epidemiologists. It is geared toward epidemiologists, but the statistics are the same, and the code is very simple.

Let's say we are expecting an odds ratio of 1.5, where there is a 30% success rate in the control group and there is a 2:1 ratio of participants in the control versus experimental group (i.e., what you describe in your post), and we want 95% power:

epi.ccsize(OR=1.50, p0=.30, n=NA, power=.95, r=2)

Which gives us a list:

$n.total
[1] 1578

$n.case
[1] 526

$n.control
[1] 1052

Translating from epidemiologist-centric language, you need 526 experimental and 1052 controls to get 95% power in that situation.

It might also be tempting to try stats::power.prop.test(), but I'm not sure how to handle your 2:1 ratio using that function. For example, this response says that you just need to make sure your smallest group hits the threshold given by power.prop.test(), but I find that that estimate is unnecessarily high:

power.prop.test(p1=.30, p2=.391304, power=.95) # these values for p1 and p2 give OR of 1.50

     Two-sample comparison of proportions power calculation 

              n = 702.1545
             p1 = 0.3
             p2 = 0.391304
      sig.level = 0.05
          power = 0.95
    alternative = two.sided

NOTE: n is number in *each* group

This overestimate jibes well with the comment to the post I linked above, where user Underminer says:

"If you do a 95/5 split, then it'll just take longer to hit the minimum sample size for the variation that is getting the 5%." - while this is a conservative approach to at least satisfying the specified power of the test, you will in actuality be exceeding the specified power entered in power.prop.test if you have one "small" and on "large" group (e.g. n1 = 19746, n2 = 375174). A more exact method of meeting power requirements for unequal sample sizes would likely be desirable

Here's a relevant RPubs link using the pwr package, discussing unequal sample sizes. However, I find the most intuitive way to do this being the way using epiR.

Related Solutions

Solved – How to calculate power (or sample size) for a multiple comparison experiment

If you have already done the experiment then there is little point in doing any power analyses. Where the P-values are small the power for the observed effect size and variability was large enough. Where the P-values are large then the power was small for the observed effect size and variability. Power analysis is useful for planning experiments, but not useful after the fact. See this paper by Hoenig & Helsey: http://www.tandfonline.com/doi/abs/10.1198/000313001300339897#preview

Your desire for a power analysis appears to be based on this statement "one must be sure that the results are 'real' and not just due to the small sample size", and so it is useful to consider it closely. Firstly, statistical analysis cannot tell you about the reality of a result--something that you probably know, given that you put the 'real' in quotes. Second, you imply that a small sample is more likely to yield a false positive result, when the reality is that a small sample is exactly as likely to do that as a large sample. The small sample is more likely to yield a false negative result.

If you want to be confident that the results yield reliable conclusions then you have to consider their nature in light of what is known about the system and, ideally, replicate the parts of the study that are most interesting or surprising. (I acknowledge that a well-judged statistical analysis is more helpful here than a poorly judged one: see Julien Sturnemann's answer for some suggestions.)

ANOVA – Effect Size Calculation for One-Way ANOVA and Tukey-HSD Tests

I was not able to reproduce the results you got from WebPower using the pilot data you supplied. I was able to reproduce your R code however.

You are correct that you can't use the $\eta^2$ for Cohen's f, but $f^2 = \frac{\eta^2}{1-\eta^2}$

"However, how should I compute the effect size from the pilot study" - use the $\eta^2$ from the pilot study.
"Why are there interaction effect sizes, i.e, the effect size for group x vs group y?" Those are the effect sizes for the pair-wise comparisons (if you were using a t-test or a TukeyHSD)

require(dplyr)
require(reshape2)

pilot <- data.frame(option1 = c(6.3, 2.8, 7.8, 7.9, 4.9),
                    option2 = c(9.9, 4.1, 3.9, 6.3, 6.9),
                    option3 = c(5.1, 2.9, 3.6, 5.7, 4.5),
                    option4 = c(1.0, 2.8, 4.8, 3.9, 1.6))
pilot2 <- pilot %>% 
  reshape2::melt(value.name = "y") %>%
  dplyr::rename("option" = "variable")

lm1 <- lm(y ~ option, data = pilot2)
aov1 <- aov(lm1)

means <- apply(pilot, 2, mean)
vs <- apply(pilot, 2, var)

# cohen's f for overall anova
# eta^2 = SSR / SST
eta.sq <- anova(lm1)$`Sum Sq`[2] / sum(anova(lm1)$`Sum Sq`)
f <- sqrt(eta.sq / (1-eta.sq))

# cohen's d for pairwise
d <- abs(means[c(1,1,1,2,2,3)] - means[c(2,3,4,3,4,4)]) / sqrt(((5-1)*vs[c(1,1,1,2,2,3)] + (5-1)*vs[c(2,3,4,3,4,4)])/ (5+5))
names(d) <- c("1-2", "1-3", "1-4", "2-3", "2-4", "3-4")

require(pwr)

# with 5 samples, we have the power to detect effect size f = 0.835
#  i.e. with only 5 samples, we need a large effect to detect

pwr::pwr.anova.test(k = 4, n = 5, sig.level = 0.05, power = 0.80)
#> 
#>      Balanced one-way analysis of variance power calculation 
#> 
#>               k = 4
#>               n = 5
#>               f = 0.8352722
#>       sig.level = 0.05
#>           power = 0.8
#> 
#> NOTE: n is number in each group

# since we have a really large effect in the pilot for f = 1.2,
#   we only need 3 per group to detect with 80% power

pwr::pwr.anova.test(k = 4, f = 1.2414, sig.level = 0.05, power = 0.80)
#> 
#>      Balanced one-way analysis of variance power calculation 
#> 
#>               k = 4
#>               n = 2.950833
#>               f = 1.2414
#>       sig.level = 0.05
#>           power = 0.8
#> 
#> NOTE: n is number in each group

Best Answer

Related Solutions

Solved – How to calculate power (or sample size) for a multiple comparison experiment

ANOVA – Effect Size Calculation for One-Way ANOVA and Tukey-HSD Tests

Related Question