Why does the Chi-squared test statistic follow the Chi-squared distribution

chi squarednormal distribution

I know the Chi-squared test statistic is defined as:

$$\chi^2=\sum_{i=1}^n\frac{({O_i-E_i})^2}{E_i}$$

where $O_i$ is observed data, and $E_i$ is expected.

I also know that the $\chi^2$ distribution is essentially defined as the sum of squared Gaussian random variables.

Does that mean that in order to use a Chi-squared test, one of your assumptions must be that $\sqrt{\frac{({O_i-E_i})^2}{E_i}}$ follows a Gaussian distribution? If so, is there an explanation/proof as to why this is a reasonable assumption?

Note: I didn't find any of the answers here super helpful: Why the chi-squared statistic follows chi-squared distribution?

Best Answer

You have $O_i \sim \operatorname{Binomial}(m, E_i/m),$ where $m$ is the sample size.

So $\dfrac{O_i - E_i}{\sqrt{E_i(1 - (E_i/m))}} \approx \dfrac{O_i - E_i}{\sqrt{E_i}}$ is approximately normal if $n$ is large.

However, notice that $\left( \dfrac{(O_i-E_i)^2}{E_i} \right),\, i=1,\ldots,n$ are not independent, nor uncorrelated. They are negatively correlated because they are subject to the constraint $$ \sum_{i=1}^n O_i = m. $$ For example, if the throw a die $1000$ times, then the sums of the numbers of times the different outcomes occur must be $1000;$ in this case we have $n=6$ and $m=1000.$ The matrix of covariances is a $6\times6$ matrix of rank $5.$ When diagonalized, five of the diagonal entries are equal to $1$ and the sixth is $0.$ That is why the chi-square distribution has $5$ degrees of freedom. It is the distribution of the sum of $5=n-1$ independent $\operatorname N(0,1)$ random variables.

Related Question