R – Exact Two Sample Proportions Binomial Test and Interpretation of Strange P-Values

binomial distributionhypothesis testingproportion;rstatistical significance

I am trying to solve the following question:

Player A won 17 out of 25 games while player B won 8 out of 20 – is
there a significant difference between both ratios?

The thing to do in R that comes to mind is the following:

> prop.test(c(17,8),c(25,20),correct=FALSE)

    2-sample test for equality of proportions without continuity correction

data:  c(17, 8) out of c(25, 20)
X-squared = 3.528, df = 1, p-value = 0.06034
alternative hypothesis: two.sided
95 percent confidence interval:
 -0.002016956  0.562016956
sample estimates:
prop 1 prop 2 
  0.68   0.40

So this test says that the difference is not significant at the 95% confidence level.

Because we know that prop.test() is only using an approximation I want to make things more exact by using an exact binomial test – and I do it both ways around:

> binom.test(x=17,n=25,p=8/20)

    Exact binomial test

data:  17 and 25
number of successes = 17, number of trials = 25, p-value = 0.006693
alternative hypothesis: true probability of success is not equal to 0.4
95 percent confidence interval:
 0.4649993 0.8505046
sample estimates:
probability of success 
                  0.68 

> binom.test(x=8,n=20,p=17/25)

    Exact binomial test

data:  8 and 20
number of successes = 8, number of trials = 20, p-value = 0.01377
alternative hypothesis: true probability of success is not equal to 0.68
95 percent confidence interval:
 0.1911901 0.6394574
sample estimates:
probability of success 
                   0.4

Now this is strange, isn't it? The p-values are totally different each time! In both cases now the results are (highly) significant but the p-values seem to jump around rather haphazardly.

My questions

Why are the p-values that different each time?
How to perform an exact two sample proportions binomial test in R correctly?

Best Answer

If you are looking for an 'exact' test for two binomial proportions, I believe you are looking for Fisher's Exact Test. In R it is applied like so:

> fisher.test(matrix(c(17, 25-17, 8, 20-8), ncol=2))
    Fisher's Exact Test for Count Data
data:  matrix(c(17, 25 - 17, 8, 20 - 8), ncol = 2)
p-value = 0.07671
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
  0.7990888 13.0020065
sample estimates:
odds ratio 
  3.101466

The fisher.test function accepts a matrix object of the 'successes' and 'failures' the two binomial proportions. As you can see, however, the two-sided hypothesis is still not significant, sorry to say. However, Fisher's Exact test is typically only applied when a cell count is low (typically this means 5 or less but some say 10), therefore your initial use of prop.test is more appropriate.

Regarding your binom.test calls, you are misunderstanding the call. When you run binom.test(x=17,n=25,p=8/20) you are testing whether proportion is significantly different from a population where the probability of success is 8/20. Likewise with binom.test(x=8,n=20,p=17/25) says the probability of success is 17/25 which is why these p-values differ. Therefore you are not comparing the two proportions at all.

Related Solutions

Solved – R – power.prop.test, prop.test, and unequal sample sizes in A/B tests

Is this method sound or at least on the right track?

Yes, I think it's a pretty good approach.

Could I specify alt="greater" on prop.test and trust the p-value even though power.prop.test was for a two-sided test?

I'm not certain, but I think you'll need to use alternative="two.sided" for prop.test.

What if the p-value was greater than .05 on prop.test? Should I assume that I have a statistically significant sample but there is no statistically significant difference between the two proportions? Furthermore, is statistical significance inherent in the p-value in prop.test - i.e. is power.prop.test even necessary?

Yes, if p-value is greater than .05 then there is no confidence that there is a detectable difference between the samples. Yes, statistical significance is inherent in the p-value, but the power.prop.test is still necessary before you start your experiment to determine your sample size. power.prop.test is used to set up your experiment, prop.test is used to evaluate the results of your experiment.

BTW - You can calculate the confidence interval for each group and see if they overlap at your confidence level. You can do that by following these steps for Calculating Many Confidence Intervals From a t Distribution.

To visualize what I mean, look at this calculator with your example data plugged in: http://www.evanmiller.org/ab-testing/chi-squared.html#!2300/20000;2100/20000@95

Here is the result:

confidence interval for each group

Notice the graphic it provides that shows the range of the confidence interval for each group.

What if I can't do a 50/50 split and need to do, say, a 95/5 split? Is there a method to calculate sample size for this case?

This is why you need to use power.prop.test because the split doesn't matter. What matters is that you meet the minimum sample size for each group. If you do a 95/5 split, then it'll just take longer to hit the minimum sample size for the variation that is getting the 5%.

What if I have no idea what my baseline prediction should be for proportions? If I guess and the actual proportions are way off, will that invalidate my analysis?

You'll need to draw a line in the sand, guess a reasonable detectable effect, and calculate the necessary sample size. If you don't have enough time, resources, etc. to meet the calculated sample size in power.prop.test, then you'll have to lower your detectable effect. I usually set it up like this and run through different delta values to see what the sample size would need to be for that effect.

#Significance Level (alpha)
alpha <- .05

# Statistical Power (1-Beta)
beta <- 0.8

# Baseline conversion rate
p <- 0.2   

# Minimum Detectable Effect
delta <- .05

power.prop.test(p1=p, p2=p+delta, sig.level=alpha, power=beta, alternative="two.sided")

Solved – two-sample t-test VS two one-sample t-tests. What’s the difference

The two-sample t-test is appropriate here, because you want to compare the two groups directly.

Two groups can differ significantly, and yet the CIs can still overlap. However, if the CIs do not overlap, then the groups must differ significantly. (This is of course assuming that the significance test and the CIs are calculated using the same assumptions about the data.) This is commonly misunderstood. Reference: http://blog.minitab.com/blog/real-world-quality-improvement/common-statistical-mistakes-you-should-avoid

How can the means of two groups differ significantly and yet have overlapping CIs? Loosely speaking, I think of it this way. There is 95% likelihood that the true mean for each group lies within the CI for that group. But in order for them to have the same mean, one group mean would lie at the extreme of its CI, and the other group mean would lie at the opposite extreme of its CI. That is an unlikely scenario.

Best Answer

Related Solutions

Solved – R – power.prop.test, prop.test, and unequal sample sizes in A/B tests

Solved – two-sample t-test VS two one-sample t-tests. What’s the difference

Related Question