You can turn the question around. Since the ordinary Pearson $\chi^2$ test is almost always more accurate than Fisher's exact test and is much quicker to compute, why does anyone use Fisher's test?
Note that it is a fallacy that the expected cell frequencies have to exceed 5 for Pearson's $\chi^2$ to yield accurate $P$-values. The test is accurate as long as expected cell frequencies exceed 1.0 if a very simple $\frac{N-1}{N}$ correction is applied to the test statistic.
From R-help, 2009:
Campbell, I. Chi-squared and Fisher-Irwin tests of two-by-two tables with small sample recommendations. Statistics in Medicine 2007; 26:3661-3675. (abstract)
...latest edition of Armitage's book recommends that continuity
adjustments never be used for contingency table chi-square tests;
E. Pearson modification of Pearson chi-square test, differing from
the original by a factor of (N-1)/N;
Cochran noted that the number 5 in "expected frequency less than 5" was arbitrary;
findings of published studies may be summarized as follows, for comparative trials:
Yates' chi-squared test has type I error rates less than the nominal, often less than half the nominal;
The Fisher-Irwin test has type I error rates less than the nominal;
K Pearson's version of the chi-squared test has type I error rates closer to the nominal than Yate's chi-squared test and the Fisher-Irwin test, but in some situations gives type I errors appreciably larger than the nominal value;
The 'N-1' chi-squared test, behaves like K. Pearson's 'N' version, but the tendency for higher than nominal values is reduced;
The two-sided Fisher-Irwin test using Irwin's rule is less conservative than the method doubling the one-sided probability;
The mid-P Fisher-Irwin test by doubling the one-sided probability performs better than standard versions of the Fisher-Irwin test, and the mid-P method by Irwin's rule performs better still in having actual type I errors closer to nominal levels.";
strong support for the 'N-1' test provided expected frequencies exceed 1;
flaw in Fisher test which was based on Fisher's premise that marginal totals carry no useful information;
demonstration of their useful information in very small sample sizes;
Yates' continuity adjustment of N/2 is a large over-correction and is inappropriate;
counter arguments exist to the use of randomization tests in randomized trials;
calculations of worst cases;
overall recommendation: use the 'N-1' chi-square test when all expected frequencies are at least 1; otherwise use the Fisher-Irwin test using Irwin's rule for two-sided tests, taking tables from either tail as likely, or less, as that observed; see letter to the editor by Antonio Andres and author's reply in 27:1791-1796; 2008.
Crans GG, Shuster JJ. How conservative is Fisher's exact test? A quantitative evaluation of the two-sample comparative binomial trial. Statistics in Medicine 2008; 27:3598-3611. (abstract)
...first paper to truly quantify the conservativeness of Fisher's test;
"the test size of FET was less than 0.035 for nearly all sample sizes before 50 and did not approach 0.05 even for sample sizes over 100.";
conservativeness of "exact" methods;
see Stat in Med 28:173-179, 2009 for a criticism which was unanswered
Lydersen S, Fagerland MW, Laake P. Recommended tests for association in $2\times 2$ tables. Statistics in Medicine 2009; 28:1159-1175. (abstract)
...Fisher's exact test should never be used unless the mid-$P$ correction is applied;
value of unconditional tests;
see letter to the editor 30:890-891;2011
In a classical hypothesis test, you have a test statistic that orders the evidence from that which is most conducive to the null hypothesis to that which is most conducive to the alternative hypothesis. (Without loss of generality, suppose that a higher value of this statistic is more conducive to the alternative hypothesis.) The p-value of the test is the probability of observing evidence at least as conducive to the alternative hypothesis as what you actually observed (a test statistic at least as large as the observed value) under the assumption that the null hypothesis is true. This is computed from the null distribution of the test statistic, which is its distribution under the assumption that the null hypothesis is true.
Now, an "exact test" is a test that computes the p-value exactly ---i.e., it computes this from the true null distribution of the test statistic. In many statistical tests, the true null distribution is complicated, but it can be approximated by another distribution, and it converges to that approximating distribution as $n \rightarrow \infty$. In particular, the so-called "chi-squared tests" are hypothesis tests where the true null distribution converges to a chi-squared distribution.
So, in a "chi-squared test" of this kind, when you compute the p-value of the test using the chi-squared distribution, this is just an approximation to the true p-value. The true p-value of the test is given by the exact test, and you are approximating this value using the approximating null distribution of the test statistic. When $n$ is large this approximation is very good, but when $n$ is small the approximation may be poor. For this reason, statisticians counsel against using the "chi-squared tests" (i.e., using the chi-squared approximation to the true null distribution) when $n$ is small.
Chi-squared tests for independence in contingency tables: Now I will examine your specific questions in relation to chi-squared tests for testing independence in contingency tables. In this context, if we have a contingency table with observed counts $O_1,...,O_K$ summing to $n \equiv \sum O_i$ then the test statistic is the Pearson statistic:
$$\chi^2 = \sum_{i=1}^K \frac{(O_i-E_i)^2}{E_i},$$
where $E_1,...,E_K$ are the expected cell values under the null hypothesis.$^\dagger$ The first thing to note here is that the observed counts $O_1,...,O_K$ are non-negative integers. For any $n<\infty$ this limits the possible values of the test statistic to a finite set of possible values, so its true null distribution will be a discrete distribution on this finite set of values. Note that the chi-squared distribution cannot be the true null distribution because it is a continuous distribution over all non-negative real numbers --- an (uncountable) infinite set of values.
As in other "chi-squared tests" the null distribution of the test statistic here is well approximated by the chi-squared distribution when $n$ is large. You are not correct to say that this is a matter of failing to "adequately approximate the theoretical chi-squared distribution" --- on the contrary, the theoretical chi-squared distribution is the approximation, not the true null distribution. The chi-squared approximation is good so long as none of the values $E_1,...,E_K$ is small. The reason that these expected values are small for low values of $n$ is that when you have a low total count value, you must expect the counts in at least some cells to be low.
$^\dagger$ For analysis of contingency tables, these expected cell counts are obtained by conditioning on the marginal totals under the null hypothesis of independence. It is not necessary for us to go into any further detail on these values.
Best Answer
The solution depends intimately on how the data were collected and summarized. This answer takes you through a process of thinking about the data, analyzing them, reflecting on the results, and improving the test until some insight is achieved. Along the way we develop and compare five variants of the $\chi^2$ test.
Fisher's test is not applicable because you have two independent samples. Assuming you decided beforehand how large each sample should be, the column counts ("marginals") are indeed fixed, as assumed by that test. But (I presume) you had no predetermined control over the total numbers of each ethnicity that would be observed, so the row counts (their marginals) are not fixed. That is contrary to what Fisher's test assumes.
(Fisher's test would indeed apply if these data had arisen from a single collection of $45$ subjects who were randomly divided by the experimenter into two groups of predetermined sizes $23$ and $22$, as is often done in controlled experiments.)
The Chi-squared Test
In these data the total count is $45$ for $5\times 2=10$ table entries, producing a mean count of $4.5$ spread through two columns of roughly equal totals ($23$ and $22$). This is starting to get into the range where rules of thumb suggest the $\chi^2$ statistic--which is just a number measuring a discrepancy between the two ethnicity distributions--may have an approximate $\chi^2$ distribution. Let us therefore begin by computing the statistic and its associated p-value. (I am using
R
for these calculations.)The output is
along with a warning that "Chi-squared approximation may be incorrect." Fair enough. But since the reported p-value is not extreme--so we're not reaching far into the tails of the distribution of the statistic--we can expect this p-value to be fairly accurate. Let's see.
Simulating the Chi-squared P-value
One way to check is to simulate the true distribution of the $\chi^2$ statistic.
R
offers a "Monte Carlo" test.Using $100,000$ iterations (and repeating that several times), this test reports a p-value consistently near $0.130$: reasonably close to the original p-value of $0.1401$.
(If I am reading the
R
source code forchisq.test
correctly, in each Monte-Carlo iteration it computes a $\chi^2$ statistic comparing the simulated data to the estimates obtained from the original data (rather than to estimates obtained from the marginals of the simulated data, as is performed in a true $\chi^2$ test). It is difficult to see how this is directly applicable to the original hypothesis. TheR
manual refers us to Hope, A. C. A. (1968) A simplified Monte Carlo significance test procedure. J. Roy. Statist. Soc. B 30, 582–598. I cannot find in that paper any justification for whatR
is doing; in particular, the paper uses independent tests of each simulated sample to assess goodness of fit for continuous distributions, whereas theR
software conducts a series of dependent tests to assess independence among samples involving discrete distributions.)Going Deeper
Another approach is to bootstrap the test. This procedure uses the data to estimate the parameters under the null hypothesis (that the two samples are from the same population), then repeatedly replicates the data-collection process by drawing new values according to that distribution. By studying the distribution of $\chi^2$ statistics that arise, we can see where the actual $\chi^2$ statistic fits--and decide whether it is sufficiently extreme to warrant rejection of the null hypothesis.
The row marginals let us estimate the relative proportions of each ethnicity under the null hypothesis:
Ethnicity_1
was observed $(2+1)/45$ of the time, etc. Each bootstrap iteration draws two independent samples from this hypothesized distribution, one of size $23$ and another of size $22$, and computes the $\chi^2$ statistic for these two samples.When you try that, you will stumble upon a very interesting phenomenon: because ethnicities 2 and 3 were observed rarely, in many simulated samples they are not observed at all. This makes it impossible to calculate a $\chi^2$ statistic based on all five ethnicities! (It would require you to divide by zero.) What to do?
You could just compute the $\chi^2$ statistic based on the ethnicities actually observed, even when only three or four different ones appear among the two samples. When I do this with $10,000$ iterations, I obtain a p-value of $0.086$.
You could compute the $\chi^2$ statistic only in those simulations where all five ethnicities were observed. This time I compute a p-value of $0.108$. (Less than $60\%$ of all simulations included all five ethnicities.)
Conclusions
We have obtained a range of p-values from $0.086$ through $0.140$, some more legitimately applicable than others. (The Fisher Exact test p-value of $0.119$, by the way, fits within this range.) If your criterion for a significant result is more stringent than $8.6\%$, there is no problem: you will not reject the null hypothesis and so you needn't worry over which tests really are applicable. But if your criterion lies within this range (such as $10\%$), then your choice of test matters.
As the preceding efforts at simulation showed so clearly, which test to use depends on your application. Do you know that only five ethnicities could have been observed? Or are you tracking only the ethnicities that happened to appear in your samples? From the gap in numbering between 2 and 4 I would guess that
Ethnicity_3
might be possible but was not observed. As such, if you choose to use a $\chi^2$ statistic based only on the ethnicities observed, then you are in situation (1) and you should report a p-value of $0.086$. If you had collected the data differently--for example, by augmenting the sample sizes until at least one of each ethnicity appeared in the dataset--then an approach comparable to (2) would be more appropriate. The key is to reproduce faithfully all details of your actual sampling procedure within the simulation so that you obtain an honest representation of the distribution of your test statistic.Planning Follow-on Studies
It may be worth remarking that even if you view this range of results as being immaterial--you would make the same decision regardless--the choice of test can nevertheless make a big difference if you plan to conduct additional experiments in the hope of demonstrating an effect. Under that assumption, by using a p-value of $0.086$ (and adopting a significance threshold of $0.05$) you would need a dataset approximately $(Z_{0.05}/Z_{0.086})^2 = 1.45$ times as great as the current one, whereas by using a p-value of $0.140$ you would want to collect $2.32$ times as much data, which will cost $60\%$ more to do.
(The "$Z_{*}$" are quantiles of a standard Normal distribution, invoked here as a rough approximation to a $\chi^2$ power and sample size analysis. The point is not to do an accurate power analysis, but only to observe that it takes relatively few additional data to lower a p-value that is near $0.05$ to below $0.05$ -- assuming the effect is real! -- compared to the amount of data needed to lower a p-value that is far from $0.05$ to below $0.05$.)