Solved – Strange way of calculating chi-squared in Excel vs R

chi-squared-testexcelr

I'm looking at an excel sheet which claims to be calculating the $\chi^2$, but I don't recognise this way of doing it, and I was wondering if I'm missing something.

Here is the data it is analysing:

+------------------+----------+----------+
| Total Population | Observed | Expected |
+------------------+----------+----------+
|             2000 |       42 | 32.5     |
|             2000 |       42 | 32.5     |
|             2000 |       25 | 32.5     |
|             2000 |       21 | 32.5     |
+------------------+----------+----------+

And here are the sums it does for each group in order to calculate chi square:

P = (sum of all observed)/(sum of total population) = 0.01625
A = (Observed - (Population * P)) ^2
B = Total Population * P * (1-P)
ChiSq = A/B

So for each group the $\chi^2$ is:

2.822793
2.822793
1.759359
4.136448

And the total Chi Square is: 11.54139.

However, every example I have seen of calculating the $\chi^2$ is completely different from this. I would do for each group:

chiSq = (Observed-Expected)^2 / Expected

And therefore for the example above I would get a total chi square value of 11.3538.

My question is – why in the excel sheet are they calculating $\chi^2$ in this way? Is this a recognised approach?

UPDATE

My reason for wanting to know this is that I am trying to replicate these results in the R language. I am using the chisq.test function and it is not coming out with the same number as the Excel sheet. So if anyone knows how to do this approach in R it would be very helpful!

UPDATE 2

If anyone's interested, here's how I calculated it in R:

res <- matrix(c((2000-42), 42, (2000-42), 42, (2000-25), 25, (2000-21), 21), 2, 4)
chisq.test(res)

Best Answer

This turns out to be quite straightforward.

This is clearly binomial sampling. There are two ways to look at it.

Method 1, that of the spreadsheet, it to treat the observed counts $X_i$ as $\sim \text{Bin}(N_i,p_i)$, which may be approximated as $\text{N}(\mu_i=N_i\cdot p_i,\sigma_i^2=N_i\cdot p_i(1-p_i))$. As such, $Z_i=(X_i-\mu_i)/\sigma_i$ are approximately standard normal, and the $Z$'s are independent, so (approximately) $\sum_i Z_i^2\sim \chi^2$.

(If the p's are based off observed counts, then the $Z$'s aren't independent, but it's still chi-square with one fewer degree of freedom.)

Method 2: your use of the $(O-E)^2/E$ form of chi-square also works, but it requires that you take account not only of those in the category you have labelled 'Observed' but also those not in that category:

+------------+------+-------+
| Population | In A | Not A |
+------------+------+-------+
|       2000 |   42 |  1958 |
|       2000 |   42 |  1958 |
|       2000 |   25 |  1975 |
|       2000 |   21 |  1979 |
+ -----------+------+-------+

Where the $E$'s for the first column are as you have them, and those for the second column are $N_i(1-p_i)$

... and then sum $(O-E)^2/E$ over both columns.

The two forms are algebraically equivalent. Note that $1/p + 1/(1-p) = 1/p(1-p)$. Consider the i$^{th}$ row of the chi-square:

\begin{eqnarray} \frac{(X_i - \mu_i)^2}{\sigma_i^2} &=& \frac{(X_i- N_ip_i)^2}{N_ip_i(1-p_i)}\\ &=& \frac{(X_i- N_ip_i)^2}{N_ip_i} +\frac{(X_i- N_ip_i)^2}{N_i(1-p_i)}\\ &=& \frac{(X_i- N_ip_i)^2}{N_ip_i} +\frac{(N_i-N_i+N_ip_i-X_i)^2}{N_i(1-p_i)}\\ &=& \frac{(X_i- N_ip_i)^2}{N_ip_i} +\frac{(N_i-X_i-(N_i-N_ip_i))^2}{N_i(1-p_i)}\\ &=& \frac{(X_i- N_ip_i)^2}{N_ip_i} +\frac{((N_i-X_i)-N_i(1-p_i))^2}{N_i(1-p_i)}\\ &=& \frac{(O^{(A)}_i- E^{(A)}_i)^2}{E^{(A)}_i} +\frac{(O^{(\bar A)}_i-E^{(\bar A)}_i)^2}{E^{(\bar A)}_i} \end{eqnarray}

Which means you should get the same answer both ways, up to rounding error.

Let's see:

             Observed             Expected                 (O-E)^2/E          
  Ni        A     not A          A      not A             A           not A      
 2000     42         1958      32.5     1967.5       2.776923077     0.045870394     
 2000     42         1958      32.5     1967.5       2.776923077     0.045870394     
 2000     25         1975      32.5     1967.5       1.730769231     0.028589581     
 2000     21         1979      32.5     1967.5       4.069230769     0.067217281     

                                            Sum     11.35384615      0.187547649  

Chi-square = 11.353846 + 0.187548 = 11.54139

Which matches their answer.