Solved – Proper analyses for 2×2 contingency tables

chi-squared-testcontingency tablesexact-testfishers-exact-testmcnemar-test

So I have multiple questions about the different types of analyses that can be used on a 2×2 contingency table and when different analyses should be used.

To open, I'll note that this stems from my efforts to find the appropriate analyses to run for research using categorical data in a 2×2 contingency table. In my case I have two conditions: control and experimental and within the experimental group there are two items of interest (say, Item A and Item B). In all cases I'm investigating whether an outcome occurs or not: How many experimental participants show the outcome on Item A? How many experimental participants show the outcome on Item B? How many control participants show the outcome? So in addition to comparing Item A with the Control group (A vs. C) and Item B with the Control group (B vs. C), I also want to compare Item A with Item B (A vs. B).

My understanding is that for between-subjects comparisons (e.g., A vs. C & B vs. C) a Pearson's $\chi^{2}$ test is traditionally used and that for within-subjects comparisons (e.g., A vs. B) that McNemar's test is a common choice.

Question 1: is this an accurate understanding of the common choices for statistical tests using 2×2 contingency tables?


My follow-up questions, and where most of my confusion lies, are related to what to do when multiple observed cell counts are low (less than 10). I've explored numerous threads here, the Wikipedia pages for different topics (Yates's correction, Fisher's exact, etc.), a few journal articles, and descriptions in R-packages, but I have found conflicting views.

Some sources indicate that you should use Yates's correction, others say it's out-dated because, in addition to being overly conservative, it was only preferred because the other options were computationally intense.

Others say that you should use Fisher's exact test, but I've seen additional information suggesting that Fisher's exact test is only appropriate if you know both the row and column totals and that other options (Boschloo, Barnard, etc.) are generally more powerful. However, which of those to use is also not clear.

Finally, even others suggest that the Pearson's $\chi^{2}$ test is appropriate as long as the expected cell counts are not low.

I've seen some indication that these arguments occur for McNemar's test as well (e.g., continuity correction & McNemar's exact test).

Question 2: Is the assumption regarding cell counts about expected cell counts or observed cell counts? If the observed cell counts are low, does that matter to the type of test that should be used?

Question 3: What is an exact test and when should exact tests be used?


In addition to answers to these questions, I would also greatly appreciate any suggestions as to books or articles I can read to learn more about this. Part of my struggle with all of this has been just general difficult finding clear information about these topics and how they relate to one another.

Best Answer

Pearson's $\chi^2$ test is useful for a sample of $n$ observations cross-classified by two variables, say $A$ and $B$. These tests test the null hypothesis that $A$ and $B$ are independent variables. So, for an example, if you crossed two strains of D. melanogaster (fruit flies) with different mutations and observed the $F_2$ generation frequencies in $n$ progeny, the $\chi^2$ test tests for linkage of the two traits (i.e., are they on different chromosones [null] or the same chromosomes [i.e., linked, the alternative]).

McNemar's test is used for paired data -- that is, each observation represents a pair of values. For an example, consider a set of $n$ lung cancer patients each with a spouse. You record the smoking habits of the patients and their spouse, and cross classify. Pearson's test would appear to have $2\,n$ observations, but in this case you only have $n$. McNemar's test makes this correction. The hypotheses tested are similar: "Is cancer status related to smoking status?"

I suppose that one could think of this as a "between subjects" vs "within subjects" difference, and there is no doubt that things are similar. I don't see them that way, but I'll confess to not having thought about it much.

In regards to your Question 2,the restriction is on expected cell counts, not observed cell counts. Observed counts are reality, while expected cell counts represent a model. You can think of the restrictions as helping to ensure a decent approximation under the null hypothesis. Reality can (and should) diverge from the model when necessary, but if the model is approximately correct, it would be bad to have a situation where discrepancies get inflated in small cells.

Finally, an exact test is precisely what it says it is. The distribution of the test statistic under the null hypothesis is known exactly. Pearson's $\chi^2$, McNemar's test, and the log-likelihood $\chi^2$ are all based on asymptotic approximations to the distribution of the test statistic under the null hypothesis. Fisher's test, by comparison, notes that conditionally on the marginal totals, the distributions in the two cells of any row (or column) of the table follow a hypergeometric distribution. This insight permits computation of an exact observed significance level ($p$-value) for any given number of observations in the $1, 1$ cell.

Fisher's exact test tests the same null as Pearson's $\chi^2$ and can be used whenever Pearson's is appropriate and in other situations where Pearson's approximation is believed to be unreliable.. Pearson's test also makes use of the information in the marginal totals, and so is also conditional on those totals. Knowing the a priori margins (or even one margin) is unnecessary.