I am trying to understand the logic behind chi-squared test.
The Chi-squared test is $\chi ^2 = \sum \frac{(obs-exp)^2}{exp}$. $\chi ^2$ is then compared to a Chi-squared distribution to find out a p.value in order to reject or not the null hypothesis. $H_0$: the observations come from the distribution we used to created our expected values. For example, we could test if the probability of obtaining head
is given by $p$ as we expect. So we flip 100 times and find $n_H$ Heads
and $1-n_H$ tails
. We want to compare our finding to what is expected ($100 \cdot p$). We could as well use a binomial distribution but it is not the point of the question… The question is:
Can you please explain why, under the null hypothesis, $\sum \frac{(obs-exp)^2}{exp}$ follows a chi-squared distribution?
All I know about the Chi-squared distribution is that the chi-squared distribution of degree $k$ is the sum of $k$ squared standard normal distribution.
Best Answer
Nevertheless, it is our starting point even for your actual question. I'll cover it somewhat informally.
Let's consider with the binomial case more generally:
$Y\sim \text{Bin}(n,p)$
Assume $n$ and $p$ are such that $Y$ is well approximated by a normal with the same mean and variance (some typical requirements are that $\min(np,n(1-p))$ is not small, or that $np(1-p)$ is not small).
Then $(Y-E(Y))^2/\text{Var}(Y)$ will be approximately $\sim\chi^2_1$. Here $Y$ is the number of successes.
We have $E(Y) = np$ and $\text{Var}(Y)=np(1-p)$.
(In the testing case, $n$ is known and $p$ is specified under $H_0$. We don't do any estimation.)
So if $H_0$ is true $(Y-np)^2/np(1-p)$ will be approximately $\sim\chi^2_1$.
Note that $(Y-np)^2 = [(n-Y)-n(1-p)]^2$. Also note that $\frac{1}{p} + \frac{1}{1-p} = \frac{1}{p(1-p)}$.
Hence $\frac{(Y-np)^2}{np(1-p)} = \frac{(Y-np)^2}{np}+\frac{(Y-np)^2}{n(1-p)}\\ \quad= \frac{(Y-np)^2}{np}+\frac{[(n-Y)-n(1-p)]^2}{n(1-p)} \\ \quad= \frac{(O_S-E_S)^2}{E_S}+\frac{(O_F-E_F)^2}{E_F}$
Which is just the chi-square statistic for the binomial case.
So in that case the chi-square statistic should have the distribution of the square of an (approximately) standard-normal random variable.