Solved – How does Pearson’s Chi Squared Statistic approximate a Chi Squared Distribution

asymptoticschi-squared-distributionchi-squared-test

So if Pearson's Chi Squared Statistic is given for a $1 \times N$ table, then its form is:

$$\sum_{i=1}^n\frac{(O_i – E_i)^2}{E_i}$$

Then this approximates $\chi_{n-1}^2$, the Chi-Squared Distribution with $n-1$ degrees of freedom, as the sample size $N$ gets larger.

What I don't understand is how this asymptotic approximation works. I feel like the $E_i$'s in the denominators should be replaced with $\frac{s_i^2}{n_i}$. Since that would give you $\chi_n^2 = \sum_{i=1}^nZ_i^2$, for $Z_i\sim n(0,1)$. But of course this has $n$ degrees of freedom, not $n-1$, so clearly something else is going on.

Best Answer

I'm going to motivate this intuitively, and indicate how it comes about for the special case of two groups, assuming you're happy to accept the normal approximation to the binomial.

Hopefully that will be enough for you to get a good sense of why it works the way it does.

You're talking about the chi-square goodness of fit test. Let's say there are $k$ groups (you have it as $n$, but there's a reason I tend to prefer to call it $k$).

In the model being applied for this situation, the counts $O_i$, $i=1,2,...,k$ are multinomial.

Let $N=\sum_{i=1}^k O_i$. The counts are conditioned on the sum $N$ (except in some fairly rare situations); and there are some prespecified set of probabilities for each category, $p_i, i=1, 2, \ldots,k$, which sum to $1$.

Just as with the binomial, there's an asymptotic normal approximation for multinomials -- indeed, if you consider only the count in a given cell ("in this category" or not), it would then be binomial. Just as with the binomial, the variances of the counts (as well as their covariances in the multinomial) are functions of $N$ and the $p$'s; you don't estimate a variance separately.

That is, if the expected counts are sufficiently large, the vector of counts is approximately normal with mean $E_i=Np_i$. However, because the counts are conditioned on $N$, the distribution is degenerate (it exists in a hyperplane of dimension $k-1$, since specifying $k-1$ of the counts fixes the remaining one). The variance-covariance matrix has diagonal entries $Np_i(1-p_i)$ and off diagonal elements $-Np_ip_j$, and it is of rank $k-1$ because of the degeneracy.

As a result, for an individual cell $\text{Var}(O_i)=Np_i(1-p_i)$, and you could write $z_i = \frac{O_i-E_i}{\sqrt{E_i(1-p_i)}}$. However, the terms are dependent (negatively correlated), so if you sum the squares of those $z_i$ it won't have the a $\chi^2_k$ distribution (as it would if they were independent standardized variables). Instead we could potentially construct a set of $k-1$ independent variables from the original $k$ which are independent and still approximately normal (asymptotically normal). If we summed their (standardized) squares, we'd get a $\chi^2_{k-1}$. There are ways to construct such a set of $k-1$ variables explicitly, but fortunately there's a very neat shortcut that avoids what amounts to a substantial amount of effort, and yields the same result (the same value of the statistic) as if we had gone to the trouble.

Consider, for simplicity, a goodness of fit with two categories (which is now binomial). The probability of being in the first cell is $p_1=p$, and in the second cell is $p_2=1-p$. There are $X = O_1$ observations in the first cell, and $N-X=O_2$ in the second cell.

The observed first cell count, $X$ is asymptotically $\text{N}(Np,Np(1-p))$. We can standardize it as $z=\frac{X-Np}{\sqrt{Np(1-p)}}$. Then $z^2 = \frac{(X-Np)^2}{Np(1-p)}$ is approximately $\sim \chi^2_1$ (asymptotically $\sim \chi^2_1$).

Notice that

$\sum_{i=1}^2 \frac{(O_i-E_i)^2}{E_i} = \frac{[X-Np]^2}{Np}+ \frac{[(N-X)-(N-Np)]^2}{N(1-p)}= \frac{[X-Np]^2}{Np}+ \frac{[X-Np]^2}{N(1-p)}=(X-Np)^2[\frac{1}{Np}+ \frac{1}{N(1-p)}]$.

But

$\frac{1}{Np}+ \frac{1}{N(1-p)} =\frac{Np+N(1-p)}{Np.N(1-p)} = \frac{1}{Np(1-p)}$.

So $\sum_{i=1}^2 \frac{(O_i-E_i)^2}{E_i} =\frac{(X-Np)^2}{Np(1-p)}$ which is the $z^2$ we started with - which asymptotically will be a $\chi^2_1$ random variable. The dependence between the two cells is such that by diving by $E_i$ instead of $E_i(1-p_i)$ we exactly compensate for the dependence between the two, and get the original square-of-an-approximately-normal random variable.

The same kind of sum-dependence is taken care of by the same approach when there are more than two categories -- by summing the $\frac{(O_i-E_i)^2}{E_i}$ instead of $\frac{(O_i-E_i)^2}{E_i(1-p_i)}$ over all $k$ terms, you exactly compensate for the effect of the dependence, and obtain a sum equivalent to a sum of $k-1$ independent normals.

There are a variety of ways to show the statistic has a distribution that asymptotically $\chi^2_{k-1}$ for larger $k$ (it's covered in some undergraduate statistics courses, and can be found in a number of undergraduate-level texts), but I don't want to lead you too far beyond the level your question suggests. Indeed derivations are easy to find in notes on the internet, for example there are two different derivations in the space of about two pages here

Related Question