I presume they result from two somewhat different approximations in this instance.
For the ordinary chi-square test, the interval that corresponds to the chi-square is the
Wilson score interval
$$\frac{1}{1 + \frac{1}{n} z_{1 - \frac{1}{2}\alpha}^2} \left[ \hat p + \frac{1}{2n} z_{1 - \frac{1}{2}\alpha}^2 \pm z_{1 - \frac{1}{2}\alpha} \sqrt{ \frac{1}{n}\hat p \left(1 - \hat p\right) + \frac{1}{4n^2}z_{1 - \frac{1}{2}\alpha}^2 } \right]$$
Looking into the code (just type prop.test
to see the code for it), it looks like you get the Wilson score interval by default, but with a continuity correction applied to $p$.
[Note that one of the references in the help (?prop.test
) discusses eleven different confidence intervals for the difference in proportions; at most one will always exactly correspond to any given form of the hypothesis test.]
While the without-continuity-correction Wilson score interval will correspond to the without-continuity-correction chi-square, my guess is that the continuity-corrected version of both that is being used no longer correspond exactly.
I guess the way to get an interval that should correspond would be to write the interval corresponding to the continuity-corrected chi-squared in similar fashion to the way the Wilson score interval is derived (see the above Wikipedia link) and solve that for the endpoints.
Is this method sound or at least on the right track?
Yes, I think it's a pretty good approach.
Could I specify alt="greater" on prop.test and trust the p-value even though power.prop.test was for a two-sided test?
I'm not certain, but I think you'll need to use alternative="two.sided"
for prop.test
.
What if the p-value was greater than .05 on prop.test? Should I assume that I have a statistically significant sample but there is no statistically significant difference between the two proportions? Furthermore, is statistical significance inherent in the p-value in prop.test - i.e. is power.prop.test even necessary?
Yes, if p-value is greater than .05 then there is no confidence that there is a detectable difference between the samples. Yes, statistical significance is inherent in the p-value, but the power.prop.test is still necessary before you start your experiment to determine your sample size. power.prop.test
is used to set up your experiment, prop.test
is used to evaluate the results of your experiment.
BTW - You can calculate the confidence interval for each group and see if they overlap at your confidence level. You can do that by following these steps for Calculating Many Confidence Intervals From a t Distribution.
To visualize what I mean, look at this calculator with your example data plugged in:
http://www.evanmiller.org/ab-testing/chi-squared.html#!2300/20000;2100/20000@95
Here is the result:
Notice the graphic it provides that shows the range of the confidence interval for each group.
What if I can't do a 50/50 split and need to do, say, a 95/5 split? Is there a method to calculate sample size for this case?
This is why you need to use power.prop.test
because the split doesn't matter. What matters is that you meet the minimum sample size for each group. If you do a 95/5 split, then it'll just take longer to hit the minimum sample size for the variation that is getting the 5%.
What if I have no idea what my baseline prediction should be for proportions? If I guess and the actual proportions are way off, will that invalidate my analysis?
You'll need to draw a line in the sand, guess a reasonable detectable effect, and calculate the necessary sample size. If you don't have enough time, resources, etc. to meet the calculated sample size in power.prop.test
, then you'll have to lower your detectable effect. I usually set it up like this and run through different delta
values to see what the sample size would need to be for that effect.
#Significance Level (alpha)
alpha <- .05
# Statistical Power (1-Beta)
beta <- 0.8
# Baseline conversion rate
p <- 0.2
# Minimum Detectable Effect
delta <- .05
power.prop.test(p1=p, p2=p+delta, sig.level=alpha, power=beta, alternative="two.sided")
Best Answer
As with all statistical tests, you compare the P-value with your significance level. If you use 0.05 as your significance level, you would reject the null hypothesis of equal success probabilities, since 0.03651 < 0.05. If, however, you use 0.01 as your significance level (as implied by your
conf.level=0.99
), you would not reject the null hypothesis, since 0.03651 ≰ 0.01.If you’re not sure what a P-value is, or on how to interpret it, Wikipedia has a not too bad explanation, with a number of illustrating examples.