In Excel, if I get a value that is less than alpha (i.e. I get a value near zero) for the CHISQ.TEST function, does this mean that the my observed data follows the expected distribution or is it the other way around?
Solved – Excel CHISQ.TEST
chi-squared-testexcel
Related Solutions
This turns out to be quite straightforward.
This is clearly binomial sampling. There are two ways to look at it.
Method 1, that of the spreadsheet, it to treat the observed counts $X_i$ as $\sim \text{Bin}(N_i,p_i)$, which may be approximated as $\text{N}(\mu_i=N_i\cdot p_i,\sigma_i^2=N_i\cdot p_i(1-p_i))$. As such, $Z_i=(X_i-\mu_i)/\sigma_i$ are approximately standard normal, and the $Z$'s are independent, so (approximately) $\sum_i Z_i^2\sim \chi^2$.
(If the p's are based off observed counts, then the $Z$'s aren't independent, but it's still chi-square with one fewer degree of freedom.)
Method 2: your use of the $(O-E)^2/E$ form of chi-square also works, but it requires that you take account not only of those in the category you have labelled 'Observed' but also those not in that category:
+------------+------+-------+
| Population | In A | Not A |
+------------+------+-------+
| 2000 | 42 | 1958 |
| 2000 | 42 | 1958 |
| 2000 | 25 | 1975 |
| 2000 | 21 | 1979 |
+ -----------+------+-------+
Where the $E$'s for the first column are as you have them, and those for the second column are $N_i(1-p_i)$
... and then sum $(O-E)^2/E$ over both columns.
The two forms are algebraically equivalent. Note that $1/p + 1/(1-p) = 1/p(1-p)$. Consider the i$^{th}$ row of the chi-square:
\begin{eqnarray} \frac{(X_i - \mu_i)^2}{\sigma_i^2} &=& \frac{(X_i- N_ip_i)^2}{N_ip_i(1-p_i)}\\ &=& \frac{(X_i- N_ip_i)^2}{N_ip_i} +\frac{(X_i- N_ip_i)^2}{N_i(1-p_i)}\\ &=& \frac{(X_i- N_ip_i)^2}{N_ip_i} +\frac{(N_i-N_i+N_ip_i-X_i)^2}{N_i(1-p_i)}\\ &=& \frac{(X_i- N_ip_i)^2}{N_ip_i} +\frac{(N_i-X_i-(N_i-N_ip_i))^2}{N_i(1-p_i)}\\ &=& \frac{(X_i- N_ip_i)^2}{N_ip_i} +\frac{((N_i-X_i)-N_i(1-p_i))^2}{N_i(1-p_i)}\\ &=& \frac{(O^{(A)}_i- E^{(A)}_i)^2}{E^{(A)}_i} +\frac{(O^{(\bar A)}_i-E^{(\bar A)}_i)^2}{E^{(\bar A)}_i} \end{eqnarray}
Which means you should get the same answer both ways, up to rounding error.
Let's see:
Observed Expected (O-E)^2/E
Ni A not A A not A A not A
2000 42 1958 32.5 1967.5 2.776923077 0.045870394
2000 42 1958 32.5 1967.5 2.776923077 0.045870394
2000 25 1975 32.5 1967.5 1.730769231 0.028589581
2000 21 1979 32.5 1967.5 4.069230769 0.067217281
Sum 11.35384615 0.187547649
Chi-square = 11.353846 + 0.187548 = 11.54139
Which matches their answer.
The problem was the absolute value, as @Scortchi noted.
Yates' correction modifies the $\chi^2$ statistic for a $2\times 2$ contingency table in an effort to correct the error made by using a (continuous) $\chi^2$ distribution to approximate the (discrete) sampling distribution of the statistic.
Recall that the $\chi^2$ statistic is based on the residuals in a contingency table: the differences between the observed counts $O$ and the expectations $E$ in each cell. (The expectations do not have to be whole numbers). In fact, only the sizes of the residuals really matter, because the residuals are always squared. Yates' correction subtracts $1/2$ from the size of each residual. Thus, the original formula
$$\chi^2 = \sum_{\text{cells}} \frac{(O_\text{cell} - E_\text{cell})^2}{E_\text{cell}}$$
becomes
$$\chi^2_\text{corrected} = \sum_{\text{cells}} \frac{(|O_\text{cell} - E_\text{cell}| - 1/2)^2}{E_\text{cell}}.$$
The R
code for chisq.test
appears to be a little subtler. Here is the relevant section. (It is buried within some nested conditionals which are not relevant here.)
if (correct && nrow(x) == 2L && ncol(x) == 2L) {
YATES <- min(0.5, abs(x - E))
if (YATES > 0)
METHOD <- paste(METHOD, "with Yates' continuity correction")
}
else YATES <- 0
STATISTIC <- sum((abs(x - E) - YATES)^2/E)
In this code, x
stores the cell counts (thus playing the role of $O$) and E
is a parallel array of expected values. The outer conditional (if
) assures the correction is applied only when (a) it is requested, as indicated by the logical value of correct
, and (b) these counts are for a $2\times 2$ table.
The use of min
replaces $1/2$ in the correction by the smallest of the absolute residuals (should any of them be smaller than $1/2$). This assures that none of the corrected absolute residuals is made any less than zero. This little nicety is not noted in the Wikipedia article. Although not the same as Yates' original proposal, it can be construed as a variation of it in which no corrected value is ever made negative:
... group the $\chi$ distribution, taking the half units of deviation from expectation as the group boundaries ... . This is equivalent to computing the values of $\chi^2$ for deviations half a unit less than the true deviations, $8$ successes, for example, being reckoned as $7\frac{1}{2}$... . This correction may be styled the correction for continuity... .
Reference
The quotation is at p. 222 of
Yates, F (1934). "Contingency table involving small numbers and the χ2 test". Supplement to the Journal of the Royal Statistical Society 1(2): 217–235.
Best Answer
Microsoft says
This means that
CHISQ.TEST
is measuring the upper tail, and so a value near zero fromCHISQ.TEST
means that the $\chi^2$ statistic is relatively large, with the observed values not close to the expected values. The olderCHITEST
was similar.