[Math] Normalization for Chi square test

probability distributionsstatistical-inferencestatistics

The formula for the Chi-Square test statistic is the following:

$\chi^2 = \sum_{i=1}^{n} \frac{(O_i – E_i)^2}{E_i}$

where O – is observed data, and E – is expected.

I'm curious why it depends on the absolute values? For example, if we change the units we're measuring we'll get a different statistics. Suppose we're performing a test on apple weights. One of the samples weights 165 gram, and we expect it to be 182 gram, then the part of the formula will be:

$\frac{(165 – 182)^2}{182} \sim 1.58791$

http://en.wikipedia.org/wiki/Pearson's_chi-squared_test

Now suppose we're living in a country where the precision is on the top. We use milligrams for everything and we get the same results in different units: 165000 milligrams and 182000, respectively. The statistic:

$\frac{(165000 – 182000)^2}{182000} \sim 1587.91$

So our conclusion will be different based on the units we used. Why? What am I missing and why the values are not normalized in the Chi-squared test?

Best Answer

In the version of this test that I am familiar with, individual data is categorical, not quantitative like your examples. And the expected and observed values should be frequencies of some category (a count of how many times it occurs), not some individual's quantitative measurement. The numbers that go in to the $E_i$ and $O_i$ positions are unitless, as they are just counts.

So for example, in a box with mixed fruit, maybe 12 pieces were bananas, but you were expecting 15 to be bananas. You will have the term $$\frac{(12-15)^2}{15}$$ and there is no way to rescale units as you did. Writing $$\frac{(12000-15000)^2}{15000}$$ would correspond to a very different scenario. There you would have seen 12000 bananas when you were expecting 15000. And the corresponding $P$ value should be a lot smaller, because it should be a lot less likely to be off by 3000 out of 15000 than 3 out of 15, when you consider the variance from one piece of fruit to the next on its chances to be a banana. So $\chi^2$ should be a lot larger in the latter case.