Solved – Exact formula Yates’ correction in R

chi-squared-testryates-correction

If I run a chi square in R with the Yates' correction, I get slightly different results from doing it by hand. What is the exact formula R is using for the Yates' correction? I use the simple code:

chisq.test(table)

(for a 2×2 table, so df = 1 and R does Yates' correction automatically)

By hand (well, in excel), I'm subtracting 0.5 from each Observed-Expected value (which I then square and subsequently divide by the expected):
= the sum of all ((O-E)-0.5)^2/E.

This is clearly not the formula R is using to get the chi square with the Yates correction, but I can't seem to find what they do use. Does anyone know?

Adding data from SO post:
Some sample data (in table saved as .csv):

Pre; Post

32; 3

512; 179

With R and Yates' correction chi square = 4.4 and p=0.035

With my own formula subtracting 0.5 chi square = 5.78 and p=0.0209

Best Answer

The problem was the absolute value, as @Scortchi noted.


Yates' correction modifies the $\chi^2$ statistic for a $2\times 2$ contingency table in an effort to correct the error made by using a (continuous) $\chi^2$ distribution to approximate the (discrete) sampling distribution of the statistic.

Recall that the $\chi^2$ statistic is based on the residuals in a contingency table: the differences between the observed counts $O$ and the expectations $E$ in each cell. (The expectations do not have to be whole numbers). In fact, only the sizes of the residuals really matter, because the residuals are always squared. Yates' correction subtracts $1/2$ from the size of each residual. Thus, the original formula

$$\chi^2 = \sum_{\text{cells}} \frac{(O_\text{cell} - E_\text{cell})^2}{E_\text{cell}}$$

becomes

$$\chi^2_\text{corrected} = \sum_{\text{cells}} \frac{(|O_\text{cell} - E_\text{cell}| - 1/2)^2}{E_\text{cell}}.$$


The R code for chisq.test appears to be a little subtler. Here is the relevant section. (It is buried within some nested conditionals which are not relevant here.)

        if (correct && nrow(x) == 2L && ncol(x) == 2L) {
            YATES <- min(0.5, abs(x - E))
            if (YATES > 0) 
              METHOD <- paste(METHOD, "with Yates' continuity correction")
        }
        else YATES <- 0
        STATISTIC <- sum((abs(x - E) - YATES)^2/E)

In this code, x stores the cell counts (thus playing the role of $O$) and E is a parallel array of expected values. The outer conditional (if) assures the correction is applied only when (a) it is requested, as indicated by the logical value of correct, and (b) these counts are for a $2\times 2$ table.

The use of min replaces $1/2$ in the correction by the smallest of the absolute residuals (should any of them be smaller than $1/2$). This assures that none of the corrected absolute residuals is made any less than zero. This little nicety is not noted in the Wikipedia article. Although not the same as Yates' original proposal, it can be construed as a variation of it in which no corrected value is ever made negative:

... group the $\chi$ distribution, taking the half units of deviation from expectation as the group boundaries ... . This is equivalent to computing the values of $\chi^2$ for deviations half a unit less than the true deviations, $8$ successes, for example, being reckoned as $7\frac{1}{2}$... . This correction may be styled the correction for continuity... .

Reference

The quotation is at p. 222 of

Yates, F (1934). "Contingency table involving small numbers and the χ2 test". Supplement to the Journal of the Royal Statistical Society 1(2): 217–235.

Related Question