Solved – Excel CHISQ.TEST

chi-squared-testexcel

In Excel, if I get a value that is less than alpha (i.e. I get a value near zero) for the CHISQ.TEST function, does this mean that the my observed data follows the expected distribution or is it the other way around?

Best Answer

Microsoft says

CHISQ.TEST returns the probability that a value of the $\chi^2$ statistic at least as high as the value calculated by the above formula could have happened by chance under the assumption of independence.

This means that CHISQ.TEST is measuring the upper tail, and so a value near zero from CHISQ.TEST means that the $\chi^2$ statistic is relatively large, with the observed values not close to the expected values. The older CHITEST was similar.

Related Solutions

Solved – Strange way of calculating chi-squared in Excel vs R

This turns out to be quite straightforward.

This is clearly binomial sampling. There are two ways to look at it.

Method 1, that of the spreadsheet, it to treat the observed counts $X_i$ as $\sim \text{Bin}(N_i,p_i)$, which may be approximated as $\text{N}(\mu_i=N_i\cdot p_i,\sigma_i^2=N_i\cdot p_i(1-p_i))$. As such, $Z_i=(X_i-\mu_i)/\sigma_i$ are approximately standard normal, and the $Z$'s are independent, so (approximately) $\sum_i Z_i^2\sim \chi^2$.

(If the p's are based off observed counts, then the $Z$'s aren't independent, but it's still chi-square with one fewer degree of freedom.)

Method 2: your use of the $(O-E)^2/E$ form of chi-square also works, but it requires that you take account not only of those in the category you have labelled 'Observed' but also those not in that category:

+------------+------+-------+
| Population | In A | Not A |
+------------+------+-------+
|       2000 |   42 |  1958 |
|       2000 |   42 |  1958 |
|       2000 |   25 |  1975 |
|       2000 |   21 |  1979 |
+ -----------+------+-------+

Where the $E$'s for the first column are as you have them, and those for the second column are $N_i(1-p_i)$

... and then sum $(O-E)^2/E$ over both columns.

The two forms are algebraically equivalent. Note that $1/p + 1/(1-p) = 1/p(1-p)$. Consider the i$^{th}$ row of the chi-square:

\begin{eqnarray} \frac{(X_i - \mu_i)^2}{\sigma_i^2} &=& \frac{(X_i- N_ip_i)^2}{N_ip_i(1-p_i)}\\ &=& \frac{(X_i- N_ip_i)^2}{N_ip_i} +\frac{(X_i- N_ip_i)^2}{N_i(1-p_i)}\\ &=& \frac{(X_i- N_ip_i)^2}{N_ip_i} +\frac{(N_i-N_i+N_ip_i-X_i)^2}{N_i(1-p_i)}\\ &=& \frac{(X_i- N_ip_i)^2}{N_ip_i} +\frac{(N_i-X_i-(N_i-N_ip_i))^2}{N_i(1-p_i)}\\ &=& \frac{(X_i- N_ip_i)^2}{N_ip_i} +\frac{((N_i-X_i)-N_i(1-p_i))^2}{N_i(1-p_i)}\\ &=& \frac{(O^{(A)}_i- E^{(A)}_i)^2}{E^{(A)}_i} +\frac{(O^{(\bar A)}_i-E^{(\bar A)}_i)^2}{E^{(\bar A)}_i} \end{eqnarray}

Which means you should get the same answer both ways, up to rounding error.

Let's see:

             Observed             Expected                 (O-E)^2/E          
  Ni        A     not A          A      not A             A           not A      
 2000     42         1958      32.5     1967.5       2.776923077     0.045870394     
 2000     42         1958      32.5     1967.5       2.776923077     0.045870394     
 2000     25         1975      32.5     1967.5       1.730769231     0.028589581     
 2000     21         1979      32.5     1967.5       4.069230769     0.067217281     

                                            Sum     11.35384615      0.187547649

Chi-square = 11.353846 + 0.187548 = 11.54139

Which matches their answer.

Solved – Exact formula Yates’ correction in R

The problem was the absolute value, as @Scortchi noted.

Yates' correction modifies the $\chi^2$ statistic for a $2\times 2$ contingency table in an effort to correct the error made by using a (continuous) $\chi^2$ distribution to approximate the (discrete) sampling distribution of the statistic.

Recall that the $\chi^2$ statistic is based on the residuals in a contingency table: the differences between the observed counts $O$ and the expectations $E$ in each cell. (The expectations do not have to be whole numbers). In fact, only the sizes of the residuals really matter, because the residuals are always squared. Yates' correction subtracts $1/2$ from the size of each residual. Thus, the original formula

$$\chi^2 = \sum_{\text{cells}} \frac{(O_\text{cell} - E_\text{cell})^2}{E_\text{cell}}$$

becomes

$$\chi^2_\text{corrected} = \sum_{\text{cells}} \frac{(|O_\text{cell} - E_\text{cell}| - 1/2)^2}{E_\text{cell}}.$$

The R code for chisq.test appears to be a little subtler. Here is the relevant section. (It is buried within some nested conditionals which are not relevant here.)

        if (correct && nrow(x) == 2L && ncol(x) == 2L) {
            YATES <- min(0.5, abs(x - E))
            if (YATES > 0) 
              METHOD <- paste(METHOD, "with Yates' continuity correction")
        }
        else YATES <- 0
        STATISTIC <- sum((abs(x - E) - YATES)^2/E)

In this code, x stores the cell counts (thus playing the role of $O$) and E is a parallel array of expected values. The outer conditional (if) assures the correction is applied only when (a) it is requested, as indicated by the logical value of correct, and (b) these counts are for a $2\times 2$ table.

The use of min replaces $1/2$ in the correction by the smallest of the absolute residuals (should any of them be smaller than $1/2$). This assures that none of the corrected absolute residuals is made any less than zero. This little nicety is not noted in the Wikipedia article. Although not the same as Yates' original proposal, it can be construed as a variation of it in which no corrected value is ever made negative:

... group the $\chi$ distribution, taking the half units of deviation from expectation as the group boundaries ... . This is equivalent to computing the values of $\chi^2$ for deviations half a unit less than the true deviations, $8$ successes, for example, being reckoned as $7\frac{1}{2}$... . This correction may be styled the correction for continuity... .

Reference

The quotation is at p. 222 of

Yates, F (1934). "Contingency table involving small numbers and the χ2 test". Supplement to the Journal of the Royal Statistical Society 1(2): 217–235.

Best Answer

Related Solutions

Solved – Strange way of calculating chi-squared in Excel vs R

Solved – Exact formula Yates’ correction in R

Reference

Related Question