[Math] Why do we divide by the expected value in the chi squared test

chi squaredstatistics

Chi squared, $\chi ^{2}$, is calculated using the formula:

$$ \chi ^{2} = \sum \frac{{(O_i – E_i)}^{2}}{E_i}$$

$\chi ^{2}$ is used to determine how well a particular model fits some observed data. The way I justify this formula is that we want a model that resembles the data very closely. Hence, we will need to check how different the model is to the observed data.

  1. We can check how different individual data points are from the expected values by $(O_i – E_i)$. Thus, $(O_i – E_i)$ is justified.

  2. To determine how well the model fits the data as a whole we can sum $(O_i – E_i)$ for all data points. $(O_i – E_i)$ will need to be squared to remove negative terms in the summation. Negative terms will lower $\chi ^{2}$ and give a flawed goodness of fit. Thus, $\sum {(O_i – E_i)}^{2}$ is justified.

My question is why do we normalize ${(O_i – E_i)}^{2}$ by $E_i$? This seems unnecessary to me.

Best Answer

The general idea is that we want the test statistic $\chi^{2}$ to be independent of the scale of the data (or more precisely, the spread) as the actual probability distribution is independent of the scale. That is, given a distribution with variance $\sigma^{2}$, we can always divide out by $\sigma^{2}$ to get a unit-variance distribution, and it is both valid and simpler to always just work with a unit-variance distribution (we just have one fewer parameter to think about).

More specifically to $\chi^{2}$, the reason that we divide by $E_{i}$ has to do with the underlying distribution. First, it is important to realize that what $\chi^{2}$ random variables are. Generally, a $\chi^{2}$ variable with $n$ degrees of freedom is the sum of squares of $n$ unit-variance normal random variables. So when we define $$ \chi^{2} = \sum \frac{(O_{i} - E_{i})^{2}}{E_{i}} $$ we are using the fact that each $\frac{(O_{i} - E_{i})}{\sqrt{E_{i}}}$ is normally distributed with unit-variance. More precisely, we can treat $O_{i}$ as being normally distributed with mean $E_{i}$ and variance $E_{i}$. The value $\frac{(O_{i} - E_{i})}{\sqrt{E_{i}}}$ is the standardized residual.

There is a twist that explains why $E_{i}$ is the variance: the value $O_{i}$ is actually poisson distributed. This makes sense since we are working with counts and not something continuous. In the poisson distribution, the variance is precisely the same as the mean $E_{i}$. The question then is why we can treat $O_{i}$ as normally distributed. The reason is that when the expected counts $E_{i}$ is `large enough', the poisson distribution is reasonably similar to the normal distribution (very, very roughly).

On a final note, the variance of a $\chi^{2}$ variable is actually not one but is dependent on the degrees of freedom. Regardless, the distribution only has the one parameter (the degrees of freedom) and no variance parameter, so it's the same idea as using a unit-variance distribution.