Solved – Interpreting the result of 2-sample test for equality of proportions with continuity correction

hypothesis testingproportion;rself-study

I've done 2-sample test for equality of proportions with continuity correction. (This is the second course of statistic I'm studying so I know basically nothing, at least so it feels.)

I'm using R and Rstudio for the first time in this course. I have no idea how to interpret the results of this test.

y1 <- 52
n1 <- 89
y2 <- 40
n2 <- 91
prop.test(c(y1,y2), c(n1,n2),
          alternative = "greater", conf.level=0.99)

2-sample test for equality of proportions with continuity correction

data:  c(y1, y2) out of c(n1, n2)
X-squared = 3.2138, df = 1, p-value = 0.03651
alternative hypothesis: greater
99 percent confidence interval:
 -0.03792798  1.00000000
sample estimates:
   prop 1    prop 2 
0.5842697 0.4395604

So p-value is 0.03651. What do I compare it with? And what are the conclusions? Do I reject or retain the null hypothesis? I know pretty much nothing, so please keep it simple.

Best Answer

As with all statistical tests, you compare the P-value with your significance level. If you use 0.05 as your significance level, you would reject the null hypothesis of equal success probabilities, since 0.03651 < 0.05. If, however, you use 0.01 as your significance level (as implied by your conf.level=0.99), you would not reject the null hypothesis, since 0.03651 ≰ 0.01.

If you’re not sure what a P-value is, or on how to interpret it, Wikipedia has a not too bad explanation, with a number of illustrating examples.

Related Solutions

P-Value and Confidence Interval Disagreement in Two Sample Test of Proportions – Analysis

I presume they result from two somewhat different approximations in this instance.

For the ordinary chi-square test, the interval that corresponds to the chi-square is the Wilson score interval

$$\frac{1}{1 + \frac{1}{n} z_{1 - \frac{1}{2}\alpha}^2} \left[ \hat p + \frac{1}{2n} z_{1 - \frac{1}{2}\alpha}^2 \pm z_{1 - \frac{1}{2}\alpha} \sqrt{ \frac{1}{n}\hat p \left(1 - \hat p\right) + \frac{1}{4n^2}z_{1 - \frac{1}{2}\alpha}^2 } \right]$$

Looking into the code (just type prop.test to see the code for it), it looks like you get the Wilson score interval by default, but with a continuity correction applied to $p$.

[Note that one of the references in the help (?prop.test) discusses eleven different confidence intervals for the difference in proportions; at most one will always exactly correspond to any given form of the hypothesis test.]

While the without-continuity-correction Wilson score interval will correspond to the without-continuity-correction chi-square, my guess is that the continuity-corrected version of both that is being used no longer correspond exactly.

I guess the way to get an interval that should correspond would be to write the interval corresponding to the continuity-corrected chi-squared in similar fashion to the way the Wilson score interval is derived (see the above Wikipedia link) and solve that for the endpoints.

Solved – R – power.prop.test, prop.test, and unequal sample sizes in A/B tests

Is this method sound or at least on the right track?

Yes, I think it's a pretty good approach.

Could I specify alt="greater" on prop.test and trust the p-value even though power.prop.test was for a two-sided test?

I'm not certain, but I think you'll need to use alternative="two.sided" for prop.test.

What if the p-value was greater than .05 on prop.test? Should I assume that I have a statistically significant sample but there is no statistically significant difference between the two proportions? Furthermore, is statistical significance inherent in the p-value in prop.test - i.e. is power.prop.test even necessary?

Yes, if p-value is greater than .05 then there is no confidence that there is a detectable difference between the samples. Yes, statistical significance is inherent in the p-value, but the power.prop.test is still necessary before you start your experiment to determine your sample size. power.prop.test is used to set up your experiment, prop.test is used to evaluate the results of your experiment.

BTW - You can calculate the confidence interval for each group and see if they overlap at your confidence level. You can do that by following these steps for Calculating Many Confidence Intervals From a t Distribution.

To visualize what I mean, look at this calculator with your example data plugged in: http://www.evanmiller.org/ab-testing/chi-squared.html#!2300/20000;2100/20000@95

Here is the result:

confidence interval for each group

Notice the graphic it provides that shows the range of the confidence interval for each group.

What if I can't do a 50/50 split and need to do, say, a 95/5 split? Is there a method to calculate sample size for this case?

This is why you need to use power.prop.test because the split doesn't matter. What matters is that you meet the minimum sample size for each group. If you do a 95/5 split, then it'll just take longer to hit the minimum sample size for the variation that is getting the 5%.

What if I have no idea what my baseline prediction should be for proportions? If I guess and the actual proportions are way off, will that invalidate my analysis?

You'll need to draw a line in the sand, guess a reasonable detectable effect, and calculate the necessary sample size. If you don't have enough time, resources, etc. to meet the calculated sample size in power.prop.test, then you'll have to lower your detectable effect. I usually set it up like this and run through different delta values to see what the sample size would need to be for that effect.

#Significance Level (alpha)
alpha <- .05

# Statistical Power (1-Beta)
beta <- 0.8

# Baseline conversion rate
p <- 0.2   

# Minimum Detectable Effect
delta <- .05

power.prop.test(p1=p, p2=p+delta, sig.level=alpha, power=beta, alternative="two.sided")