Solved – R – power.prop.test, prop.test, and unequal sample sizes in A/B tests

ab-testhypothesis testingproportion;rstatistical significance

Say I want to know what sample size I need for an experiment in which I'm seeking to determine whether or not the difference in two proportions of success is statistically significant. Here is my current process:

  1. Look at historical data to establish baseline predictions. Say that in the past, taking an action results in a 10% success rate whereas not taking an action results in a 9% success rate. Assume that these conclusions have not been statistically validated but that they are based on relatively large amounts of data (10,000+ observations).
  2. Plug these assumptions into power.prop.test to get the following:

     power.prop.test(p1=.1,p2=.11,power=.9)
    
     Two-sample comparison of proportions power calculation 
    
              n = 19746.62
             p1 = 0.1
             p2 = 0.11
      sig.level = 0.05
          power = 0.9
    alternative = two.sided
    
  3. So this tells me that I would need a sample size of ~20000 in each group of an A/B test in order to detect a significant difference between proportions.

  4. The next step is to perform the experiment with 20,000 observations in each group. Group B (no action taken) has 2300 successes out of 20,000 observations, whereas Group A (action taken) has 2200 successes out of 20,000 observations.

  5. Do a prop.test

    prop.test(c(2300,2100),c(20000,20000))
    
    2-sample test for equality of proportions with continuity correction
    
    data:  c(2300, 2100) out of c(20000, 20000)
    X-squared = 10.1126, df = 1, p-value = 0.001473
    alternative hypothesis: two.sided
    95 percent confidence interval:
    0.003818257 0.016181743
    sample estimates:
    prop 1 prop 2 
    0.115  0.105
    
  6. So we say that we can reject the null hypothesis that the proportions are equal.

Questions

  • Is this method sound or at least on the right track?
  • Could I specify alt="greater" on prop.test and trust the p-value even though power.prop.test was for a two-sided test?
  • What if the p-value was greater than .05 on prop.test? Should I assume that I have a statistically significant sample but there is no statistically significant difference between the two proportions? Furthermore, is statistical significance inherent in the p-value in prop.test – i.e. is power.prop.test even necessary?
  • What if I can't do a 50/50 split and need to do, say, a 95/5 split? Is there a method to calculate sample size for this case?
  • What if I have no idea what my baseline prediction should be for proportions? If I guess and the actual proportions are way off, will that invalidate my analysis?

Any other gaps that you could fill in would be much appreciated – my apologies for the convoluted nature of this post. Thank you!

Best Answer

Is this method sound or at least on the right track?

Yes, I think it's a pretty good approach.

Could I specify alt="greater" on prop.test and trust the p-value even though power.prop.test was for a two-sided test?

I'm not certain, but I think you'll need to use alternative="two.sided" for prop.test.

What if the p-value was greater than .05 on prop.test? Should I assume that I have a statistically significant sample but there is no statistically significant difference between the two proportions? Furthermore, is statistical significance inherent in the p-value in prop.test - i.e. is power.prop.test even necessary?

Yes, if p-value is greater than .05 then there is no confidence that there is a detectable difference between the samples. Yes, statistical significance is inherent in the p-value, but the power.prop.test is still necessary before you start your experiment to determine your sample size. power.prop.test is used to set up your experiment, prop.test is used to evaluate the results of your experiment.

BTW - You can calculate the confidence interval for each group and see if they overlap at your confidence level. You can do that by following these steps for Calculating Many Confidence Intervals From a t Distribution.

To visualize what I mean, look at this calculator with your example data plugged in: http://www.evanmiller.org/ab-testing/chi-squared.html#!2300/20000;2100/20000@95

Here is the result:

confidence interval for each group

Notice the graphic it provides that shows the range of the confidence interval for each group.

What if I can't do a 50/50 split and need to do, say, a 95/5 split? Is there a method to calculate sample size for this case?

This is why you need to use power.prop.test because the split doesn't matter. What matters is that you meet the minimum sample size for each group. If you do a 95/5 split, then it'll just take longer to hit the minimum sample size for the variation that is getting the 5%.

What if I have no idea what my baseline prediction should be for proportions? If I guess and the actual proportions are way off, will that invalidate my analysis?

You'll need to draw a line in the sand, guess a reasonable detectable effect, and calculate the necessary sample size. If you don't have enough time, resources, etc. to meet the calculated sample size in power.prop.test, then you'll have to lower your detectable effect. I usually set it up like this and run through different delta values to see what the sample size would need to be for that effect.

#Significance Level (alpha)
alpha <- .05

# Statistical Power (1-Beta)
beta <- 0.8

# Baseline conversion rate
p <- 0.2   

# Minimum Detectable Effect
delta <- .05

power.prop.test(p1=p, p2=p+delta, sig.level=alpha, power=beta, alternative="two.sided")
Related Question