Solved – Is the chi-squared test appropriate with many small counts in a 5×2 table

categorical datachi-squared-testcontingency tablesfishers-exact-testpython

I have two sample populations, A, and B, which are independent.

             A    B
Ethnicity_1  1    2
Ethnicity_2  3    0
Ethnicity_4  1    0
Ethnicity_5  3    8
Ethnicity_6  15   12

In order to work out whether there is a statistically significant difference between the makeup of the samples (with a null hypothesis assuming that the two samples are the same), is the correct test to perform a chi-squared contingency table:

scipy.stats.chi2_contingency

Or, given the sizes are small, is a Fisher's exact test more appropriate – as it appears Fisher's exact test in scipy can't support bigger tables than 2×2.

Best Answer

The solution depends intimately on how the data were collected and summarized. This answer takes you through a process of thinking about the data, analyzing them, reflecting on the results, and improving the test until some insight is achieved. Along the way we develop and compare five variants of the $\chi^2$ test.


Fisher's test is not applicable because you have two independent samples. Assuming you decided beforehand how large each sample should be, the column counts ("marginals") are indeed fixed, as assumed by that test. But (I presume) you had no predetermined control over the total numbers of each ethnicity that would be observed, so the row counts (their marginals) are not fixed. That is contrary to what Fisher's test assumes.

(Fisher's test would indeed apply if these data had arisen from a single collection of $45$ subjects who were randomly divided by the experimenter into two groups of predetermined sizes $23$ and $22$, as is often done in controlled experiments.)

The Chi-squared Test

In these data the total count is $45$ for $5\times 2=10$ table entries, producing a mean count of $4.5$ spread through two columns of roughly equal totals ($23$ and $22$). This is starting to get into the range where rules of thumb suggest the $\chi^2$ statistic--which is just a number measuring a discrepancy between the two ethnicity distributions--may have an approximate $\chi^2$ distribution. Let us therefore begin by computing the statistic and its associated p-value. (I am using R for these calculations.)

x <- cbind(A=c(1,3,1,3,15), B=c(2,0,0,8,12))
chisq.test(x)

The output is

X-squared = 6.9206, df = 4, p-value = 0.1401

along with a warning that "Chi-squared approximation may be incorrect." Fair enough. But since the reported p-value is not extreme--so we're not reaching far into the tails of the distribution of the statistic--we can expect this p-value to be fairly accurate. Let's see.

Simulating the Chi-squared P-value

One way to check is to simulate the true distribution of the $\chi^2$ statistic. R offers a "Monte Carlo" test.

chisq.test(x, simulate.p.value=TRUE, B=1e5)

Using $100,000$ iterations (and repeating that several times), this test reports a p-value consistently near $0.130$: reasonably close to the original p-value of $0.1401$.

(If I am reading the R source code for chisq.test correctly, in each Monte-Carlo iteration it computes a $\chi^2$ statistic comparing the simulated data to the estimates obtained from the original data (rather than to estimates obtained from the marginals of the simulated data, as is performed in a true $\chi^2$ test). It is difficult to see how this is directly applicable to the original hypothesis. The R manual refers us to Hope, A. C. A. (1968) A simplified Monte Carlo significance test procedure. J. Roy. Statist. Soc. B 30, 582–598. I cannot find in that paper any justification for what R is doing; in particular, the paper uses independent tests of each simulated sample to assess goodness of fit for continuous distributions, whereas the R software conducts a series of dependent tests to assess independence among samples involving discrete distributions.)

Going Deeper

Another approach is to bootstrap the test. This procedure uses the data to estimate the parameters under the null hypothesis (that the two samples are from the same population), then repeatedly replicates the data-collection process by drawing new values according to that distribution. By studying the distribution of $\chi^2$ statistics that arise, we can see where the actual $\chi^2$ statistic fits--and decide whether it is sufficiently extreme to warrant rejection of the null hypothesis.

The row marginals let us estimate the relative proportions of each ethnicity under the null hypothesis: Ethnicity_1 was observed $(2+1)/45$ of the time, etc. Each bootstrap iteration draws two independent samples from this hypothesized distribution, one of size $23$ and another of size $22$, and computes the $\chi^2$ statistic for these two samples.

When you try that, you will stumble upon a very interesting phenomenon: because ethnicities 2 and 3 were observed rarely, in many simulated samples they are not observed at all. This makes it impossible to calculate a $\chi^2$ statistic based on all five ethnicities! (It would require you to divide by zero.) What to do?

  1. You could just compute the $\chi^2$ statistic based on the ethnicities actually observed, even when only three or four different ones appear among the two samples. When I do this with $10,000$ iterations, I obtain a p-value of $0.086$.

  2. You could compute the $\chi^2$ statistic only in those simulations where all five ethnicities were observed. This time I compute a p-value of $0.108$. (Less than $60\%$ of all simulations included all five ethnicities.)

Conclusions

We have obtained a range of p-values from $0.086$ through $0.140$, some more legitimately applicable than others. (The Fisher Exact test p-value of $0.119$, by the way, fits within this range.) If your criterion for a significant result is more stringent than $8.6\%$, there is no problem: you will not reject the null hypothesis and so you needn't worry over which tests really are applicable. But if your criterion lies within this range (such as $10\%$), then your choice of test matters.

As the preceding efforts at simulation showed so clearly, which test to use depends on your application. Do you know that only five ethnicities could have been observed? Or are you tracking only the ethnicities that happened to appear in your samples? From the gap in numbering between 2 and 4 I would guess that Ethnicity_3 might be possible but was not observed. As such, if you choose to use a $\chi^2$ statistic based only on the ethnicities observed, then you are in situation (1) and you should report a p-value of $0.086$. If you had collected the data differently--for example, by augmenting the sample sizes until at least one of each ethnicity appeared in the dataset--then an approach comparable to (2) would be more appropriate. The key is to reproduce faithfully all details of your actual sampling procedure within the simulation so that you obtain an honest representation of the distribution of your test statistic.


Planning Follow-on Studies

It may be worth remarking that even if you view this range of results as being immaterial--you would make the same decision regardless--the choice of test can nevertheless make a big difference if you plan to conduct additional experiments in the hope of demonstrating an effect. Under that assumption, by using a p-value of $0.086$ (and adopting a significance threshold of $0.05$) you would need a dataset approximately $(Z_{0.05}/Z_{0.086})^2 = 1.45$ times as great as the current one, whereas by using a p-value of $0.140$ you would want to collect $2.32$ times as much data, which will cost $60\%$ more to do.

(The "$Z_{*}$" are quantiles of a standard Normal distribution, invoked here as a rough approximation to a $\chi^2$ power and sample size analysis. The point is not to do an accurate power analysis, but only to observe that it takes relatively few additional data to lower a p-value that is near $0.05$ to below $0.05$ -- assuming the effect is real! -- compared to the amount of data needed to lower a p-value that is far from $0.05$ to below $0.05$.)

Related Question