Solved – stats for 2×4 contingency table with both very large and small or zero counts

chi-squared-testcontingency tablesfishers-exact-testr

I have 7000 2×4 contingency tables with count data. They represent a particular position in a genome and the number of times each dna nucleotide is observed at that position in 2 different environments. an example contingency table would be

position X  A      C      G      T 
condition1  0      2      20     70000
condition2  3      15     0      95000

or
position Y  A      C     G       T 
condition1  80146  0     5       0
condition2  26821  2     4       0

The data can only be positive integers. Minimum counts are 0 and maximum can be >150,000. One count is generally nearly all of the total counts for that row and column (e.g. the same in both conditions, for example cell T in the first case above and cell A in the second), and then 1 or 2 other cells will have low counts… it is in these other cells where the difference, if any, should be observed.

The goal is to identify the positions which are significantly different between these 2 environmental conditions to further analyze. Our measurement method is estimated to have an error rate of 10^-6.

Problems/doubts I have:

  1. I cannot do a fisher's test on numbers this large using a 2×4 table.
    I can run the 2×2 table but its lots of tests so its a big correction for multiple testing AND the result seems to be influenced by total sum of the row (for example, condition2 may have generally lower total counts), which is something about the fisher test I don't understand.

2.I am getting a warning from the chi square test using R that the Chi-squared approximation may be incorrect and I am not sure about this test when there are cells with small or 0 values.

Any suggestions on what test would be good in this case? I am using R to do all the stats.

Thanks in advance,

Ron

Best Answer

It's not the observed values that R is generating the objection to, but the expected values; it's possible to have a mix of high and low observed without triggering that warning.

Note that one possibility is to simulate the distribution of the chi-square statistic (i.e. fix the margins and randomly generate tables from the set of tables with the same margin).

(R will do that automatically with the argument simulate.p.value=TRUE, though you'll very likely also want to increase the value of B - the number of simulations - from the default value as well, since the lowest p-value estimate possible is 1/B)

In addition, it appears you have the possibility of some columns being all-zero. Your best bet would be to drop the offending column from the calculation when that happens.

Related Question