R – Applying Fisher’s Exact Test on a 2×4 Table: Step-by-Step Guide

fishers-exact-testr

I am working on 4 different plants. I have the (RNA-Seq) data from sequencing. I look for two events E1 and E2, say, at certain positions in their genome. Let's say E2 is the common one and E1 is a special event. The positions I look for are identical across all 4 (in the genome). And let's say I see that the observation I have for 1 particular position is:

    P1   P2  P3  P4
E1   0   20   0   17
E2 100   80 100  120

Here, P1 through P4 refers to plants and E1 and E2 refers to the events. E2 is more common. So, my objective is to actually check if E1 occurs more often in one or more plants than in the others. If they occur at the same proportion in all, of course it is not interesting to me.

I have 2 questions: (I have already asked question 1 before but didn't get an answer)

Will a fisher test be right for this problem?

I hypothesize (Null) that the proportion of E1 is not different in occurrence between the 4 plants. Now, I set out to find if the proportion I have here is by any means significantly different. I use R, fisher.test() and I get p-value=2.5e-10. So, I reject my Null hypothesis for this case because I find strong evidence against it.
Sometimes, I have an observation like this,
```
      P1   P2  P3  P4
E1   0   20   0  17
E2   0   80   0 120 
```
then fisher.test() gives me a p-value of 0.147. So, I don't reject the Null hypothesis. However, from a biological point of view, I would consider this significant. However fisher test answers the question I originally asked. I guess the proportion 0/0 for P1 and P3 are not useful (or not used).
So, my question is: Is it possible to modify the test such that it is sensitive even if E1 and E2 are 0 in 1 or more plants for a particular observation?

Having thought a bit to frame this post, I guess, in that case I have to ask a different question.

Best Answer

You can use a Fisher exact test in your first example, though with so large a sample then a Chi-square test will give a similar result and without specialist software will be easier to calculate. Just looking at the numbers, it seems obvious you will reject your null hypothesis: E1 happens quite frequently in your observations of P2 and P4 but not at all with P1 and P3.

In your second example you have no information at all about P1 and P3. So all you are testing is whether there is a difference between P2 and P4. There is a difference in your observations, but it is obviously not as large as in your first example. The statistic is telling you the difference is not significant and so you should not reject your null hypothesis. And this is what you need to be told with this data.

Related Solutions

Fisher’s Exact Test in R – Analyzing a 2×4 Table

In a sense this is analogous to a situation where you test for differences in group means with ANOVA and then perform a post hoc test, such as Tukey's HSD, to tell which groups are the ones that actually differ. But, there is no equivalent post hoc test for Fisher's test.

The only "post hoc" thing that comes to mind is to run all pairwise comparisons for the table, and correct the p-values accordingly with, e.g., the Bonferroni method.

For a Chi square test, you could check the residuals or simply the expected-observed counts. In addition, going throught the percentages of observations in each group would probably answer your question at least partly, and this could be used with either Fisher's or Chi square test.

In R these can be done as follows:

# Percentages for rows and columns
# These a higher proportion of females than males in group D 
prop.table(tab, 1) # rows
prop.table(tab, 2) # columns

# Chi square residuals
# The largest residuals are in the group D
chisq.test(tab)$residuals

# Chi square expected-observed
chisq.test(tab)$expected-chisq.test(tab)$observed

# Chi square "post hoc" test
# For Fisher you need to do this by hand
library(NCstats) # from rforge.net
chisqPostHoc(chisq.test(t(tab))) # for A-D
chisqPostHoc(chisq.test(tab)) # for gender

Solved – Alternative to multiple Fisher’s Exact tests

There are many ways to combine the information from the individual tests. Some examples follow. Where it makes sense to use ones near the top of the list, I'd lean toward those rather than the two at the end:

(a) If in the three situations, both test and control are believed to be independent draws from the same population of values (a 'test' population with constant proportion, and a control 'population' with its won constant proportion - just different sized samples being drawn in each case), then you can simply combine the data tables and test that. Point and interval estimates based on the combined data reflect the common population values.

(b) even when you don't assume a constant control proportion and a constant test proportion (as in (a) above), under the null the difference in proportions should still be zero. You can estimate the difference in proportion for each case and add the estimated proportions and add the variances of the estimates to construct a single statistic. If the difference in proportions were constant, you could get point and interval estimates for it, but the test still works as a test even when you don't assume a constant difference in proportion -- it will be sensitive to a tendency of the differences to be in the same direction. It would usually be reasonable to use a normal approximation for this test statistic, but you might also look at simulating distributions under the null.

(c) (again) in the case where the test and control proportions are not assumed to be constant across the three experiments under the alternative, you could still construct a statistic that combines information from the tables in other ways. One example would be to assume it's not the difference in proportions that's constant under the alternative, but that the log-odds is constant; you could then combine estimates of the log-odds (such as by forming weighted averages of them) and use that as an overall test statistic.

(d) You could combined (by addition) chi-square values for the individual tables; the chi-square approximation should be better in the combined case, though again it should be possible to construct simulated null distributions.

(e) if the tests are independent, you can use the Fisher procedure (see also here), which is effectively to multiply p-values, as one would tend to with independent probabilities (though by working on the log-scale, it's easier to compute the distribution).

If the nulls are true, the p-values have a uniform distribution. The $-2 \ln p_i$ should be exponentially distributed with mean $2$ (i.e. $\chi^2_2$) and adding those will give something that under the null should be $\chi^2_6$. If the combined result is unusually large for a $\chi^2_6$, you'd reject the null that the p-values were drawn from a uniform distribution in favor of the alternative that they tended to be smaller. In this particular case we have the slight problem that - even under the null - the p-values are discrete, so if the numbers are very small you might want to consider simulation under the null here as well.

(f) you could even add p-values. If the common nulls are true, the p-values (again) should be uniform; the sum of the p-values should have the distribution of a sum of uniforms; again this sum can be tested (in this case you test whether the sum of the p-values is too small to have come from a sum-of-uniforms), though again the discreteness may be an issue in some cases.

Where it's reasonable (from your prior knowledge of the situation) to make some assumptions (such as constant proportions, constant differences of proportions, constant log odds, or whatever) you should probably do so; this is usually more meaningful than say falling back on case (e), even though it's still a perfectly valid thing to do.

Best Answer

Related Solutions

Fisher’s Exact Test in R – Analyzing a 2×4 Table

Solved – Alternative to multiple Fisher’s Exact tests

Related Question