P-Value and Confidence Interval Disagreement in Two Sample Test of Proportions – Analysis

chi-squared-testconfidence intervalp-valueproportion;r

I'm using R to calculate the two-sample test for equality of proportions, where the two proportions are 350/400 and 25/25. So:

> prop.test(c(350,25),c(400,25))                                                                                                                                                           

        2-sample test for equality of proportions with continuity correction

data:  c(350, 25) out of c(400, 25) 
X-squared = 2.4399, df = 1, p-value = 0.1183
alternative hypothesis: two.sided 
95 percent confidence interval:
 -0.17865986 -0.07134014 
sample estimates:
prop 1 prop 2 
 0.875  1.000 

Warning message:
In prop.test(c(350, 25), c(400, 25), correct = FALSE) :
  Chi-squared approximation may be incorrect

What I can't reconcile on my own is that the p-value is greater than 0.05, and yet the 95% confidence interval for the difference does not include 0. I thought there was an 'if and only if' relationship between the two (The p-value < alpha iff the (1-alpha) confidence interval of the difference does not include 0).

What am I not seeing? My only guess is there's something fundamental that I'm misunderstanding, or that it has something to do with that warning message about chi-squared approximation.

Best Answer

I presume they result from two somewhat different approximations in this instance.

For the ordinary chi-square test, the interval that corresponds to the chi-square is the Wilson score interval

$$\frac{1}{1 + \frac{1}{n} z_{1 - \frac{1}{2}\alpha}^2} \left[ \hat p + \frac{1}{2n} z_{1 - \frac{1}{2}\alpha}^2 \pm z_{1 - \frac{1}{2}\alpha} \sqrt{ \frac{1}{n}\hat p \left(1 - \hat p\right) + \frac{1}{4n^2}z_{1 - \frac{1}{2}\alpha}^2 } \right]$$

Looking into the code (just type prop.test to see the code for it), it looks like you get the Wilson score interval by default, but with a continuity correction applied to $p$.

[Note that one of the references in the help (?prop.test) discusses eleven different confidence intervals for the difference in proportions; at most one will always exactly correspond to any given form of the hypothesis test.]

While the without-continuity-correction Wilson score interval will correspond to the without-continuity-correction chi-square, my guess is that the continuity-corrected version of both that is being used no longer correspond exactly.

I guess the way to get an interval that should correspond would be to write the interval corresponding to the continuity-corrected chi-squared in similar fashion to the way the Wilson score interval is derived (see the above Wikipedia link) and solve that for the endpoints.

Related Solutions

Solved – Difference estimate & confidence intervals for $\chi^2$ test between 2 proportions

Just for statistically intimidated people (wait, it should be 'intimidated by statistics') that can identify themselves with this joke:

A patient asks his surgeon what the odds are of him surviving an impending operation. The doctor replies that the odds are usually 50-50. "But there is no need to worry," the doctor explains.

"The first fifty have already died.

It is meant to be a quick reference for those confused with the terminology and overlapping tests - I am not addressing confidence intervals.

COMPARING PROPORTIONS BETWEEN SAMPLES:

I will use the following toy tabulated data:

Antacid <- matrix(c(64, 178 - 64, 92, 190 - 92), nrow = 2)    
Antacid_marginals <- matrix(c(64, 178 - 64, 92, 190 - 92), nrow = 2)
    Antacid_marginals <- rbind(Antacid_marginals, margin.table(Antacid_marginals,2))
    Antacid_marginals <- cbind(Antacid_marginals, margin.table(Antacid_marginals,1))
    dimnames(Antacid_marginals) = list(Symptoms = c("Heartburn", "Normal","Totals"),
                                       Medication = c("Drug A", "Drug B", "Totals"))

This is what it looks like (with marginals):

           Medication
Symptoms    Drug A Drug B Totals
  Heartburn     64     92    156
  Normal       114     98    212
  Totals       178    190    368

So we have 368patients: 178 on Drug A, and 190 on Drug B and we try to see if there are differences in the proportion of heartburn symptoms between drug A and B, i.e. $p1 = 64/178$ vs $p2 = 92/190$.

1. FISHER EXACT TEST: There is a discussion on Wikipedia about "Controversies". Based on the hypergeometric distribution, it is probably most adequate when the expected values in any of the cells of a contingency table are below 5 - 10. The story of the RA Fisher and the tea lady is great, and can be reproduced in [R] by simply grabbing the code here. [R] seems to tolerate without a pause the large numbers in our data (no problem with factorials):

 Antacid <- matrix(c(64, 178 - 64, 92, 190 - 92), nrow = 2)
    fisher.test(Antacid, alternative = "two.sided")
    Fisher's Exact Test for Count Data
data:  Antacid
p-value = 0.02011
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
 0.3850709 0.9277156
sample estimates:
odds ratio 
 0.5988478

2. CHI-SQUARE TEST OF HOMOGENEITY: Otherwise known as the goodness of fit Pearson's chi squared test. For larger samples (> 5 expected frequency count in each cell) the $\chi^2$ provides an approximation of the significance value. The test is based on calculating the expected frequency counts obtained by cross-multiplying the marginals (assuming normal distribution of the marginals, it makes sense that we end up with a $\chi^2$ distributed test statistic, since if $X\sim N(\mu,\sigma^)$, then $X^2\sim \chi^2(1))$:

                    Medication
Symptoms       Drug A                   Drug B               
Heartburn     156 * 178 / 368 = 75      156 * 190 / 368 = 81 
Normal        212 * 178 / 368 = 103     212 * 190 / 368 = 109

The degrees of freedom will be calculated as the {number of populations (Heartburn sufferers and Normals, i.e. 2) minus 1 } * {number of levels in the categorical variable (Drug A and Drug B, i.e. 2) minus 1}. Therefore, in a 2x2 table we are dealing with 1 d.f. And crucially, a $\chi^2$ of $1\,df$ is exactly a squared $N \sim (0,1)$ (proof here), which explains the sentence "a chi-square test for equality of two proportions is exactly the same thing as a z-test. The chi-squared distribution with one degree of freedom is just that of a normal deviate, squared. You're basically just repeating the chi-squared test on a subset of the contingency table" in this post.

The Test Statistic is calculated as:

$\chi^2=\frac{(64-75)^2}{75} + \frac{(92-81)^2}{81} +\frac{(114-103)^2}{103} + \frac{(98-109)^2}{109} = 5.39$

This is calculated in R with the function prop.test() or chisq.test(), which should yield the same result, as indicated here:

prop.test(Antacid, correct = F)

2-sample test for equality of proportions without continuity correction

data:  Antacid
X-squared = 5.8481, df = 1, p-value = 0.01559
alternative hypothesis: two.sided
95 percent confidence interval:
 -0.22976374 -0.02519514
sample estimates:
   prop 1    prop 2 
0.4102564 0.5377358

or..

chisq.test(Antacid, correct = F)

Pearson's Chi-squared test

data:  Antacid
X-squared = 5.8481, df = 1, p-value = 0.01559

3. G-TEST: The Pearson's chi-test statistic is the second order Taylor expansion around 1 of the G test; hence they tend to converge. In R:

library(DescTools) 
GTest(Antacid, correct = 'none')
Log likelihood ratio (G-test) test of
independence without correction

data:  Antacid
G = 5.8703, X-squared df = 1, p-value = 0.0154

4. Z-TEST OF PROPORTIONS: The normal distribution is a good approximation for a binomial when $np>5$ and $n(1-p)>5$. When the occurrences of successes are small in comparison with the total amount of observations, it is the actual number of expected observations that will determine if a normal approximation of a poisson process can be considered ($\lambda \geq 5$).

Although the post hyperlinked is old, I haven't found in CV an R function for it. This may be due to the fact explained above re: $\chi^2_{(df=1)}\sim \, N_{(0,1)}^2$.

The Test Statistic is:

$ \displaystyle Z = \frac{\frac{x_1}{n_1}-\frac{x_2}{n_2}}{\sqrt{p\,(1-p)(1/n_1+1/n_2)}}$ with $\displaystyle p = \frac{x_1\,+\,x_2}{n_1\,+\,n_2}$, where $x_1$ and $x_2$ are the number of "successes" (in our case, sadly, heartburn), over the number of subjects in that each one of the levels of the categorical variable (Drug A and Drug B), i.e. $n_1$ and $n_2$.

In the linked page there is an ad hoc formula. I have been toying with a spin-off with a lot of loose ends. It defaults to a two-tailed alpha value of 0.05, but can be changed, as much as it can be turned into a one tailed t = 1:

zprop = function(x1, x2, n1, n2, alpha=0.05, t = 2){
  nume = (x1/n1) - (x2/n2)
  p = (x1 + x2) / (n1 + n2)
  deno = sqrt(p * (1 - p) * (1/n1 + 1/n2))
  z = nume / deno
  print(c("Z value:",abs(round(z,4))))
  print(c("Cut-off quantile:", 
    abs(round(qnorm(alpha/t),2))))
  print(c("pvalue:", pnorm(-abs(z))))
}

In our case:

    zprop(64, 92 , 178, 190)
    [1] Z value:          2.4183  
    [1] Cut-off quantile: 1.96             
    [1] pvalue:           0.0077

Giving the same z value as the function in the R-Bloggers: z.prop(64, 178 , 92, 190) [1] -5.44273

OK... Hope it helps somebody out there, and I'm sure mistakes will be pointed out...

R Prop.test – Addressing Chi-squared Approximation Errors

The warning is because one of the expected values in the chi-squared is less than 5.

a <- c(6, 15)
b <- c(26, 171)
m <- matrix(c(a, b-a), ncol=2)
chisq.test(m)
chisq.test(m)$expected

However, that rule of thumb is a bit conservative and there are other rules of thumb that you can consider. Some of those other rules of thumb are passed and some are not.

Instead of a chi-squared test, there is also a binomial proportion test.

p1 <- 6/26
n1 <- 26
p2 <- 15/171
n2 <- 171
p <- (n1 * p1 + n2 * p2)/ (n1 + n2)
z <- (p1 - p2) / sqrt(p * (1-p) * (1/n1 + 1/n2))
z

Here we use a normal approximation to the binomial distribution. For this approximiation, there is a rule of thumb that both $np > 5$ and $n(1-p) > 5$ which is true for both proportions. Also, for these two proportions, the normal approximation looks reasonable to me when plotted.

hist(rbinom(10000, 26, 6/26))
hist(rbinom(10000, 171, 15/171))

For this data, the binomial proportion test give a one-sided p-value=0.0139. The one sided prop.test gives a p-value=0.03137.

As @EdM mentions in the comments below, some people feel Fisher's exact test maybe suitable in this situation. This other page nicely gives references to a few people about the appropriateness of Fisher's exact test and it looks like the matter is not yet decided. This test gives a one-sided p-value=0.03963

fisher.test(m, alternative = 'greater')

Best Answer

Related Solutions

Solved – Difference estimate & confidence intervals for $\chi^2$ test between 2 proportions

R Prop.test – Addressing Chi-squared Approximation Errors

Related Question