Machine Learning – Why Exact Tests Are Preferred Over Chi-Squared for Small Sample Sizes

chi-squared-testdistributionsmachine learningmathematical-statisticsstatistical significance

I am aware that tests such as Fisher's exact test are sometimes preferable to chi-squared if your expected values are low in a contingency table, when looking to test homogeneity of groups (historically people have suggested 5 although some seem to think this is conservative).

I can't however seem to find an explanation of why chi-squared does not work well for small sample sizes. I therefore have 2 questions:

What causes expected values in a contingency table to become small as sample size reduces?(I am assuming here the small expected values are a result of the small sample size).
Why is it that the chi-squared test should not be used for small sample sizes? I have seen people say it does not adequately approximate the theoretical chi-squared distribution but can someone explain why/how it doesn't?

Best Answer

In a classical hypothesis test, you have a test statistic that orders the evidence from that which is most conducive to the null hypothesis to that which is most conducive to the alternative hypothesis. (Without loss of generality, suppose that a higher value of this statistic is more conducive to the alternative hypothesis.) The p-value of the test is the probability of observing evidence at least as conducive to the alternative hypothesis as what you actually observed (a test statistic at least as large as the observed value) under the assumption that the null hypothesis is true. This is computed from the null distribution of the test statistic, which is its distribution under the assumption that the null hypothesis is true.

Now, an "exact test" is a test that computes the p-value exactly ---i.e., it computes this from the true null distribution of the test statistic. In many statistical tests, the true null distribution is complicated, but it can be approximated by another distribution, and it converges to that approximating distribution as $n \rightarrow \infty$. In particular, the so-called "chi-squared tests" are hypothesis tests where the true null distribution converges to a chi-squared distribution.

So, in a "chi-squared test" of this kind, when you compute the p-value of the test using the chi-squared distribution, this is just an approximation to the true p-value. The true p-value of the test is given by the exact test, and you are approximating this value using the approximating null distribution of the test statistic. When $n$ is large this approximation is very good, but when $n$ is small the approximation may be poor. For this reason, statisticians counsel against using the "chi-squared tests" (i.e., using the chi-squared approximation to the true null distribution) when $n$ is small.

Chi-squared tests for independence in contingency tables: Now I will examine your specific questions in relation to chi-squared tests for testing independence in contingency tables. In this context, if we have a contingency table with observed counts $O_1,...,O_K$ summing to $n \equiv \sum O_i$ then the test statistic is the Pearson statistic:

$$\chi^2 = \sum_{i=1}^K \frac{(O_i-E_i)^2}{E_i},$$

where $E_1,...,E_K$ are the expected cell values under the null hypothesis.$^\dagger$ The first thing to note here is that the observed counts $O_1,...,O_K$ are non-negative integers. For any $n<\infty$ this limits the possible values of the test statistic to a finite set of possible values, so its true null distribution will be a discrete distribution on this finite set of values. Note that the chi-squared distribution cannot be the true null distribution because it is a continuous distribution over all non-negative real numbers --- an (uncountable) infinite set of values.

As in other "chi-squared tests" the null distribution of the test statistic here is well approximated by the chi-squared distribution when $n$ is large. You are not correct to say that this is a matter of failing to "adequately approximate the theoretical chi-squared distribution" --- on the contrary, the theoretical chi-squared distribution is the approximation, not the true null distribution. The chi-squared approximation is good so long as none of the values $E_1,...,E_K$ is small. The reason that these expected values are small for low values of $n$ is that when you have a low total count value, you must expect the counts in at least some cells to be low.

$^\dagger$ For analysis of contingency tables, these expected cell counts are obtained by conditioning on the marginal totals under the null hypothesis of independence. It is not necessary for us to go into any further detail on these values.

Related Solutions

Solved – Given the power of computers these days, is there ever a reason to do a chi-squared test rather than Fisher’s exact test

You can turn the question around. Since the ordinary Pearson $\chi^2$ test is almost always more accurate than Fisher's exact test and is much quicker to compute, why does anyone use Fisher's test?

Note that it is a fallacy that the expected cell frequencies have to exceed 5 for Pearson's $\chi^2$ to yield accurate $P$-values. The test is accurate as long as expected cell frequencies exceed 1.0 if a very simple $\frac{N-1}{N}$ correction is applied to the test statistic.

From R-help, 2009:

Campbell, I. Chi-squared and Fisher-Irwin tests of two-by-two tables with small sample recommendations. Statistics in Medicine 2007; 26:3661-3675. (abstract)

...latest edition of Armitage's book recommends that continuity adjustments never be used for contingency table chi-square tests;
E. Pearson modification of Pearson chi-square test, differing from the original by a factor of (N-1)/N;
Cochran noted that the number 5 in "expected frequency less than 5" was arbitrary;
findings of published studies may be summarized as follows, for comparative trials:

Yates' chi-squared test has type I error rates less than the nominal, often less than half the nominal;
The Fisher-Irwin test has type I error rates less than the nominal;
K Pearson's version of the chi-squared test has type I error rates closer to the nominal than Yate's chi-squared test and the Fisher-Irwin test, but in some situations gives type I errors appreciably larger than the nominal value;
The 'N-1' chi-squared test, behaves like K. Pearson's 'N' version, but the tendency for higher than nominal values is reduced;
The two-sided Fisher-Irwin test using Irwin's rule is less conservative than the method doubling the one-sided probability;
The mid-P Fisher-Irwin test by doubling the one-sided probability performs better than standard versions of the Fisher-Irwin test, and the mid-P method by Irwin's rule performs better still in having actual type I errors closer to nominal levels.";

strong support for the 'N-1' test provided expected frequencies exceed 1;
flaw in Fisher test which was based on Fisher's premise that marginal totals carry no useful information;
demonstration of their useful information in very small sample sizes;
Yates' continuity adjustment of N/2 is a large over-correction and is inappropriate;
counter arguments exist to the use of randomization tests in randomized trials;
calculations of worst cases;
overall recommendation: use the 'N-1' chi-square test when all expected frequencies are at least 1; otherwise use the Fisher-Irwin test using Irwin's rule for two-sided tests, taking tables from either tail as likely, or less, as that observed; see letter to the editor by Antonio Andres and author's reply in 27:1791-1796; 2008.

Crans GG, Shuster JJ. How conservative is Fisher's exact test? A quantitative evaluation of the two-sample comparative binomial trial. Statistics in Medicine 2008; 27:3598-3611. (abstract)

...first paper to truly quantify the conservativeness of Fisher's test;
"the test size of FET was less than 0.035 for nearly all sample sizes before 50 and did not approach 0.05 even for sample sizes over 100.";
conservativeness of "exact" methods;
see Stat in Med 28:173-179, 2009 for a criticism which was unanswered

Lydersen S, Fagerland MW, Laake P. Recommended tests for association in $2\times 2$ tables. Statistics in Medicine 2009; 28:1159-1175. (abstract)

...Fisher's exact test should never be used unless the mid-$P$ correction is applied;
value of unconditional tests;
see letter to the editor 30:890-891;2011

Statistical Significance – How to Calculate Chi-Squared Values for Percentages

As long as the percentages all add to 100 ((not the case in your illustration) and reflect mutually exclusive and exhaustive outcomes (not the case either), you can compute $X^2$ using the percentages, and multiply it by $N/100$.

In your case, you really have a 3-way table. It appears that what you'd really like to know is how age and sex affect relapse rates. So I think you're better off forgetting the chi-square stuff, and instead using the actual frequencies for each cell:

relapse     n    Age   Sex
    667  1802  18-25     M
    759  2108  25-34     M
    ...

Then run a logistic regression model with Age, Sex, and Age:Sex as the predictors. You can then see what the effects of those factors are, do comparisons among predictions, etc. It'd be a lot more informative than a chi-square statsitic of some independence hypothesis.

Best Answer

Related Solutions

Solved – Given the power of computers these days, is there ever a reason to do a chi-squared test rather than Fisher’s exact test

Statistical Significance – How to Calculate Chi-Squared Values for Percentages

Related Question