Solved – R – power.prop.test, prop.test, and unequal sample sizes in A/B tests

ab-testhypothesis testingproportion;rstatistical significance

Say I want to know what sample size I need for an experiment in which I'm seeking to determine whether or not the difference in two proportions of success is statistically significant. Here is my current process:

Look at historical data to establish baseline predictions. Say that in the past, taking an action results in a 10% success rate whereas not taking an action results in a 9% success rate. Assume that these conclusions have not been statistically validated but that they are based on relatively large amounts of data (10,000+ observations).

Plug these assumptions into power.prop.test to get the following:

 power.prop.test(p1=.1,p2=.11,power=.9)

 Two-sample comparison of proportions power calculation 

          n = 19746.62
         p1 = 0.1
         p2 = 0.11
  sig.level = 0.05
      power = 0.9
alternative = two.sided

So this tells me that I would need a sample size of ~20000 in each group of an A/B test in order to detect a significant difference between proportions.
The next step is to perform the experiment with 20,000 observations in each group. Group B (no action taken) has 2300 successes out of 20,000 observations, whereas Group A (action taken) has 2200 successes out of 20,000 observations.

Do a prop.test

prop.test(c(2300,2100),c(20000,20000))

2-sample test for equality of proportions with continuity correction

data:  c(2300, 2100) out of c(20000, 20000)
X-squared = 10.1126, df = 1, p-value = 0.001473
alternative hypothesis: two.sided
95 percent confidence interval:
0.003818257 0.016181743
sample estimates:
prop 1 prop 2 
0.115  0.105

So we say that we can reject the null hypothesis that the proportions are equal.

Questions

Is this method sound or at least on the right track?
Could I specify alt="greater" on prop.test and trust the p-value even though power.prop.test was for a two-sided test?
What if the p-value was greater than .05 on prop.test? Should I assume that I have a statistically significant sample but there is no statistically significant difference between the two proportions? Furthermore, is statistical significance inherent in the p-value in prop.test – i.e. is power.prop.test even necessary?
What if I can't do a 50/50 split and need to do, say, a 95/5 split? Is there a method to calculate sample size for this case?
What if I have no idea what my baseline prediction should be for proportions? If I guess and the actual proportions are way off, will that invalidate my analysis?

Any other gaps that you could fill in would be much appreciated – my apologies for the convoluted nature of this post. Thank you!

Best Answer

Is this method sound or at least on the right track?

Yes, I think it's a pretty good approach.

Could I specify alt="greater" on prop.test and trust the p-value even though power.prop.test was for a two-sided test?

I'm not certain, but I think you'll need to use alternative="two.sided" for prop.test.

What if the p-value was greater than .05 on prop.test? Should I assume that I have a statistically significant sample but there is no statistically significant difference between the two proportions? Furthermore, is statistical significance inherent in the p-value in prop.test - i.e. is power.prop.test even necessary?

Yes, if p-value is greater than .05 then there is no confidence that there is a detectable difference between the samples. Yes, statistical significance is inherent in the p-value, but the power.prop.test is still necessary before you start your experiment to determine your sample size. power.prop.test is used to set up your experiment, prop.test is used to evaluate the results of your experiment.

BTW - You can calculate the confidence interval for each group and see if they overlap at your confidence level. You can do that by following these steps for Calculating Many Confidence Intervals From a t Distribution.

To visualize what I mean, look at this calculator with your example data plugged in: http://www.evanmiller.org/ab-testing/chi-squared.html#!2300/20000;2100/20000@95

Here is the result:

confidence interval for each group

Notice the graphic it provides that shows the range of the confidence interval for each group.

What if I can't do a 50/50 split and need to do, say, a 95/5 split? Is there a method to calculate sample size for this case?

This is why you need to use power.prop.test because the split doesn't matter. What matters is that you meet the minimum sample size for each group. If you do a 95/5 split, then it'll just take longer to hit the minimum sample size for the variation that is getting the 5%.

What if I have no idea what my baseline prediction should be for proportions? If I guess and the actual proportions are way off, will that invalidate my analysis?

You'll need to draw a line in the sand, guess a reasonable detectable effect, and calculate the necessary sample size. If you don't have enough time, resources, etc. to meet the calculated sample size in power.prop.test, then you'll have to lower your detectable effect. I usually set it up like this and run through different delta values to see what the sample size would need to be for that effect.

#Significance Level (alpha)
alpha <- .05

# Statistical Power (1-Beta)
beta <- 0.8

# Baseline conversion rate
p <- 0.2   

# Minimum Detectable Effect
delta <- .05

power.prop.test(p1=p, p2=p+delta, sig.level=alpha, power=beta, alternative="two.sided")

Related Solutions

Solved – prop.test returning significant p-value but a confidence interval including 0

I ran into the same problem myself. Your question came up as I was googling for explanations. (What I'm saying is that I might not be the utmost authority on this.) With that caveat, here's what I found:

The comment above by @MichaelChernick seems to be right; the reason the confidence interval for your test includes zero is that the confidence interval is only approximately a 95% confidence interval. Google "misbehavior of binomial confidence intervals" (or something similar) for more.

I started looking around and found this applet if anyone is interested in just finding the 95% confidence interval for the difference of two proportions without messing around in R. At the bottom of the page containing the applet, they reference a paper by Newcombe that appears to be widely accepted as the best way to calculate CIs for a difference of proportions (specifically, his method #10.)

I looked at the paper and couldn't really do the math. There are a few R packages that claim to implement this method but the best one I found was the Epi package. I'll use your example to explain:

tab <- matrix(c(10,10,22,70), nrow = 2)
tab

     [,1] [,2]
[1,]   10   22
[2,]   10   70

Fisher's Exact test will give you a precise p-value for your table. The problem is that the test statistic it uses is an odds ratio, so it's not good for calculating the difference of proportions or confidence intervals. It is good for p-values though, so use it to check the p-value:

fisher.test(tab)

Fisher's Exact Test for Count Data

data:  tab
p-value = 0.03
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
 1.0 9.7
sample estimates:
odds ratio 
   3.1

So we have an exact p-value of 0.03 which is significant at the 0.05 level. Now look at the twoby2() function from the Epi package:

library(Epi)
twoby2(tab)

2 by 2 table analysis: 
------------------------------------------------------ 
Outcome   : Col 1 
Comparing : Row 1 vs. Row 2 

       Col 1 Col 2    P(Col 1) 95% conf. interval
Row 1    10    22        0.31     0.177     0.49
Row 2    10    70        0.12     0.069     0.22

                                    95% conf. interval
             Relative Risk:  2.50     1.152     5.43
         Sample Odds Ratio:  3.18     1.172     8.64
Conditional MLE Odds Ratio:  3.14     1.028     9.70
    Probability difference:  0.19     0.027     0.37

         Exact P-value: 0.028 
    Asymptotic P-value: 0.023 
------------------------------------------------------

Note that the results given include the following:

Individual proportion estimates with 95% CI
Difference of proportions (0.19) with a 95% CI that does not include zero. (Again, according to the authors this function uses the method described in the Newcombe paper)
an Exact p-value (which matches Fisher's Exact p-value calculated earlier)
an asymptotic p-value. In my particular problem, this p-value matched the one given by the prop.test() function. That is not the case here for some reason.

Hope this helps anyone else who may wind up here!

Solved – How many trials are needed to get a statistically important proportion of 0.003 for a binomial variable

if in the control and treatment groups, the proportions of success are both 0.003, then what is the minimal sample size for statistical testing of the two equal proportions

When you are doing hypothesis testing then the null hypothesis, when it is true, will be rejected by the significance level $\alpha$ that you choose, or when the null hypothesis is not true, it will be rejected by a rate that is ideally much higher than the significance level.

What is important is not only the case "the proportions of success are both 0.003", but instead also the cases when those proportions are different. The more different the proportions are, the more probable it becomes that you will observe a significant difference and reject the null hypothesis.

In order to determine what size of sample is neccesary to take, you could express something like the probability to observe a significant difference, given a true difference (of some specific effect size), as function of the sample sizes. So to compute the sample size you need 1) an idea of a relevant minimal difference/effect 2) a level of desired power/probability.

It is important to specify this minimal difference, since in practice the null hypothesis is almost never true. Some way or another the different treatment might have a tiny miniscule effect (not of the kind of size that was theoretically expected) and given a large enough sample you might show that the two groups are different by a tiny minuscule amount.

When doing hypothesis testing, we often challenge the null hypothesis (there is no effect) in order to show whether there is an effect or not. But what researchers might actually be interested in is to challenge the alternative hypothesis (there is an effect) in order to show whether the hypothesized effect is true or not.

Note: There is a difference between 'not rejecting the null hypothesis' and 'rejecting the alternative hypothesis'.

Two ways to deal with this type of problem are two one-sided t-tests (TOST) and likelihood ratio test. In both cases you explicitly specify both the hypotheses (null/alternative).

To the point: To do the computations of sample size you can approximate the variables as normal distributed. In a simple way you use the 0.003 as an initial value by which you can compute the variance, but a more difficult case is when the proportions turn out to be smaller than initially expected (which reduces the number of successes and you actually wish to have a certain number of successes rather than a certain number of total sample).

Best Answer

Related Solutions

Solved – prop.test returning significant p-value but a confidence interval including 0

Solved – How many trials are needed to get a statistically important proportion of 0.003 for a binomial variable

Related Question