Solved – Conditional or unconditional exact test in R

chi-squared-testfishers-exact-testr

I have a 2×2 contingency table and i want to calculate if the pair inside is significantly different.
i made a matrix like the following named raw_matrix

          CNS random
Not_H3K4  343  28825
H3K4      11   2014

Create this matrix , thus:

raw_matrix = structure(c(343, 11, 28825, 2014), 
    .Dim = c(2L, 2L), .Dimnames = list(
    c("NotH3K", "H3K"), c("CNS", "Random")))

as i searched, unconditional exact test like Barnard’s and Boschloo’s exact tests are the most powerful test for this end. i installed the 'Exact' package and tried to do the test using this command:

exact.test(raw_matrix)

it took more than half an hour on a 64GB ram and 3.5 GH CPU computer and finally it gave the following error:

    Error: cannot allocate vector of size 42.0 Gb
In addition: Warning messages:
1: In matrix(A[xTbls + 1, ] * B[yTbls + 1, ], ncol = length(int)) :
  Reached total allocation of 61417Mb: see help(memory.size)
2: In matrix(A[xTbls + 1, ] * B[yTbls + 1, ], ncol = length(int)) :
  Reached total allocation of 61417Mb: see help(memory.size)
3: In matrix(A[xTbls + 1, ] * B[yTbls + 1, ], ncol = length(int)) :
  Reached total allocation of 61417Mb: see help(memory.size)
4: In matrix(A[xTbls + 1, ] * B[yTbls + 1, ], ncol = length(int)) :
  Reached total allocation of 61417Mb: see help(memory.size)

then i installed 'Exact2x2' package and did the test using this command:

exact2x2(raw_matrix)

which gave me the following results:

    Two-sided Fisher's Exact Test (usual method using minimum likelihood)

data:  raw_matrix
p-value = 0.006433
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
 1.2028 4.2424
sample estimates:
odds ratio 
  2.178631

but as i read in the 'Exact'package tutorial , the fisher exact test which is a conditional exact test is not so powerful. finally i did the normal chi square test using the command chisq.test(raw.matrix) which gave the following results that is different from fisher test's results:

    Pearson's Chi-squared test with Yates' continuity correction

data:  test_1
X-squared = 6.2045, df = 1, p-value = 0.01274

im a Geneticist and not an expert in statistics, i appreciate if anybody could tell me what is the best strategy here to do this test

Best Answer

What is the nature of you're underlying data? It could be the case that the approximation provided by the Chi-Squared test is reasonable. The basic idea is that if you have enough data, and it is reasonably evenly distributed across the cells in your table, the Chi-Squared approximation is reasonable (as long as other assumptions are met such as random sampling). The general rule of thumb given is that each cell should have at least 80% of the cells have a count of 5 or greater, and no cells have a count of 0. This is a heuristic, so if you have very unbalanced data or something like that you might want to do a bit more research, but if appropriate conditions are satisfied you may just want to proceed with a Chi-Squared test.

If these criterion are not met, Fisher showed that p-values for 2x2 tables can be obtained exactly as the probability of the cell counts can be shown to be a Hypergeometric distribution. This can be generalized to larger tables, and I believe at least some of the R packages estimate this p-value using a Monte Carlo method. An additional issue to consider is that these p-values may actually be conservative. The "mid-p" value can be used to correct for this, but I am not certain about the theoretical underpinnings of this approach.

Finally, I am not familiar with the exact2x2 package, but if you believe the p-value produced by this package is reasonable, it doesn't appear that you have issues with power. Saying a test is not powerful means that we are concerned that we will not correctly reject the null hypothesis when it is false. Given that the test the exact2x2 package conducted resulted in rejection of the null hypothesis for common significance levels I would think that the lack of power is less of a concern.

Related Solutions

Solved – Fisher’s Exact Test and Hypergeometric Distribution

Fisher's exact test works by conditioning upon the table margins (in this case, 5 males and females and 5 soda drinkers and non-drinkers). Under the assumptions of the null hypothesis, the cell probabilities for observing a male soda drinker, male non-soda drinker, female soda drinker, or female non-soda drinker are all equally likely (0.25) because of the margin totals.

The particular table you used for the FET has no table aside from its converse, 5 female non-soda drinkers and 5 male soda drinkers, which is "at least as unlikely" under the null hypothesis. So you'll notice that doubling the probability you obtained in your hypergeometric density gives you the FET p-value.

Solved – Can Fisher’s exact test accept a ‘vector of probabilities’

What you did isn't the way you run a chi-squared test. You need a contingency table. For each group, you have some people in STEM fields and some people who aren't. Thus, you will have two rows of counts, or two cells per group. Then you run a chi-squared test of the independence of the rows and columns. Here is a slightly edited version of your data:

totals                      <- c(195, 134, 38)
stems                       <- c(22,16,9)
group_stem_counts           <- matrix(c(stems, totals-stems),ncol=3,byrow=TRUE)
rownames(group_stem_counts) <- c("stem", "non-stem")
colnames(group_stem_counts) <- c("Group One","Group Two","Group Three")
group_stem_counts
#          Group One Group Two Group Three
# stem            22        16           9
# non-stem       173       118          29

Now you can run your test:

chisq.test(group_stem_counts)
# 
#         Pearson's Chi-squared test
# 
# data:  group_stem_counts
# X-squared = 4.5225, df = 2, p-value = 0.1042
# 
# Warning message:
# In chisq.test(group_stem_counts) :
#   Chi-squared approximation may be incorrect

This yields the warning that you saw. As a rule of thumb, it is generally recommended that the expected count for each cell under the null hypothesis to be at least 5. However, it has been shown that this is overly conservative, and the chi-squared test is robust even if that isn't exactly the case. We can examine your expected counts like so:

chisq.test(group_stem_counts)$expected
#          Group One Group Two Group Three
# stem      24.97275  17.16076    4.866485
# non-stem 170.02725 116.83924   33.133515
# Warning message:
# In chisq.test(group_stem_counts) :
#   Chi-squared approximation may be incorrect

Your minimum expected count is 4.866485, and all the others are >5. Realistically, this is nothing to bother over. However, if you are concerned about it, you can just simulate the p-value instead of using the chi-squared approximation. Here is the chi-squared test using that option:

chisq.test(group_stem_counts, simulate.p.value=TRUE)
# 
#         Pearson's Chi-squared test with simulated p-value (based on 2000
#         replicates)
# 
# data:  group_stem_counts
# X-squared = 4.5225, df = NA, p-value = 0.1184

As you can see, the p-value is essentially the same.

Best Answer

Related Solutions

Solved – Fisher’s Exact Test and Hypergeometric Distribution

Solved – Can Fisher’s exact test accept a ‘vector of probabilities’

Related Question