I ran into the same problem myself. Your question came up as I was googling for explanations. (What I'm saying is that I might not be the utmost authority on this.) With that caveat, here's what I found:
The comment above by @MichaelChernick seems to be right; the reason the confidence interval for your test includes zero is that the confidence interval is only approximately a 95% confidence interval. Google "misbehavior of binomial confidence intervals" (or something similar) for more.
I started looking around and found this applet if anyone is interested in just finding the 95% confidence interval for the difference of two proportions without messing around in R. At the bottom of the page containing the applet, they reference a paper by Newcombe that appears to be widely accepted as the best way to calculate CIs for a difference of proportions (specifically, his method #10.)
I looked at the paper and couldn't really do the math. There are a few R packages that claim to implement this method but the best one I found was the Epi package. I'll use your example to explain:
tab <- matrix(c(10,10,22,70), nrow = 2)
tab
[,1] [,2]
[1,] 10 22
[2,] 10 70
Fisher's Exact test will give you a precise p-value for your table. The problem is that the test statistic it uses is an odds ratio, so it's not good for calculating the difference of proportions or confidence intervals. It is good for p-values though, so use it to check the p-value:
fisher.test(tab)
Fisher's Exact Test for Count Data
data: tab
p-value = 0.03
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
1.0 9.7
sample estimates:
odds ratio
3.1
So we have an exact p-value of 0.03 which is significant at the 0.05 level. Now look at the twoby2()
function from the Epi package:
library(Epi)
twoby2(tab)
2 by 2 table analysis:
------------------------------------------------------
Outcome : Col 1
Comparing : Row 1 vs. Row 2
Col 1 Col 2 P(Col 1) 95% conf. interval
Row 1 10 22 0.31 0.177 0.49
Row 2 10 70 0.12 0.069 0.22
95% conf. interval
Relative Risk: 2.50 1.152 5.43
Sample Odds Ratio: 3.18 1.172 8.64
Conditional MLE Odds Ratio: 3.14 1.028 9.70
Probability difference: 0.19 0.027 0.37
Exact P-value: 0.028
Asymptotic P-value: 0.023
------------------------------------------------------
Note that the results given include the following:
- Individual proportion estimates with 95% CI
- Difference of proportions (0.19) with a 95% CI that does not include zero. (Again, according to the authors this function uses the method described in the Newcombe paper)
- an Exact p-value (which matches Fisher's Exact p-value calculated earlier)
- an asymptotic p-value. In my particular problem, this p-value matched the one given by the
prop.test()
function. That is not the case here for some reason.
Hope this helps anyone else who may wind up here!
if in the control and treatment groups, the proportions of success are both 0.003, then what is the minimal sample size for statistical testing of the two equal proportions
When you are doing hypothesis testing then the null hypothesis, when it is true, will be rejected by the significance level $\alpha$ that you choose, or when the null hypothesis is not true, it will be rejected by a rate that is ideally much higher than the significance level.
What is important is not only the case "the proportions of success are both 0.003", but instead also the cases when those proportions are different. The more different the proportions are, the more probable it becomes that you will observe a significant difference and reject the null hypothesis.
In order to determine what size of sample is neccesary to take, you could express something like the probability to observe a significant difference, given a true difference (of some specific effect size), as function of the sample sizes. So to compute the sample size you need 1) an idea of a relevant minimal difference/effect 2) a level of desired power/probability.
It is important to specify this minimal difference, since in practice the null hypothesis is almost never true. Some way or another the different treatment might have a tiny miniscule effect (not of the kind of size that was theoretically expected) and given a large enough sample you might show that the two groups are different by a tiny minuscule amount.
When doing hypothesis testing, we often challenge the null hypothesis (there is no effect) in order to show whether there is an effect or not. But what researchers might actually be interested in is to challenge the alternative hypothesis (there is an effect) in order to show whether the hypothesized effect is true or not.
Note: There is a difference between 'not rejecting the null hypothesis' and 'rejecting the alternative hypothesis'.
Two ways to deal with this type of problem are two one-sided t-tests (TOST) and likelihood ratio test. In both cases you explicitly specify both the hypotheses (null/alternative).
To the point: To do the computations of sample size you can approximate the variables as normal distributed. In a simple way you use the 0.003 as an initial value by which you can compute the variance, but a more difficult case is when the proportions turn out to be smaller than initially expected (which reduces the number of successes and you actually wish to have a certain number of successes rather than a certain number of total sample).
Best Answer
Yes, I think it's a pretty good approach.
I'm not certain, but I think you'll need to use
alternative="two.sided"
forprop.test
.Yes, if p-value is greater than .05 then there is no confidence that there is a detectable difference between the samples. Yes, statistical significance is inherent in the p-value, but the power.prop.test is still necessary before you start your experiment to determine your sample size.
power.prop.test
is used to set up your experiment,prop.test
is used to evaluate the results of your experiment.BTW - You can calculate the confidence interval for each group and see if they overlap at your confidence level. You can do that by following these steps for Calculating Many Confidence Intervals From a t Distribution.
To visualize what I mean, look at this calculator with your example data plugged in: http://www.evanmiller.org/ab-testing/chi-squared.html#!2300/20000;2100/20000@95
Here is the result:
Notice the graphic it provides that shows the range of the confidence interval for each group.
This is why you need to use
power.prop.test
because the split doesn't matter. What matters is that you meet the minimum sample size for each group. If you do a 95/5 split, then it'll just take longer to hit the minimum sample size for the variation that is getting the 5%.You'll need to draw a line in the sand, guess a reasonable detectable effect, and calculate the necessary sample size. If you don't have enough time, resources, etc. to meet the calculated sample size in
power.prop.test
, then you'll have to lower your detectable effect. I usually set it up like this and run through differentdelta
values to see what the sample size would need to be for that effect.