# Hypothesis Testing – Extension of Pearson’s Chi-Squared Test to Multinomial Random Variables

chi-squared-testgoodness of fithypothesis testing

# Background

Suppose we observe $$n$$ IID Bernoulli variables and our null hypothesis is that their common probability is $$p$$.
For denote by $$\mathbb{1}_{\{i\}}$$ the outcome of observation $$i$$.

Then by the central limit theorem

$$\frac{\frac{1}{\sqrt{n}}\sum_{i=1}^n (\mathbb{1}_{\{i\}} – p)}{\sqrt{p \cdot (1 – p)}} \rightarrow N(0, 1),$$
which can be used for hypothesis testing.

Suppose now that the null hypothesis is instead that each variable has an individual probability of success, $$p_i$$ (they are still independent). Then a simple argument allows us to use Lyapunov's version of the CLT and can thus conclude

$$\frac{\sum_{i=1}^n (\mathbb{1}_{\{i\}} – p_i)}{\sqrt{\sum_i p_i \cdot (1 – p_i)}} \rightarrow N(0, 1),$$ which can then be used to test this composite hypothesis.

# Question

If instead we have $$k$$ categories and our null hypothesis is that the probabilities for each is $$p_k$$ and we have observed the $$n_i$$ occurrences of each outcome $$i$$ then we can use the Chi-Square Goodness of Fit test stating that if the $$n_i$$ sum to $$n$$ then

$$\sum_{i=1}^k \frac{(n_i – n \cdot p_i)^2}{n \cdot p_i} \rightarrow \chi^2(k – 1)$$.

Analogously to above, instead I want to form a null hypothesis where I conduct $$n$$ experiments, but for each of them the $$k$$ categories have separate probabilities $$\{(p_1^1, p_2^1, \ldots p_k^1), (p_1^2, p_2^2, \ldots p_k^2), \ldots, (p_1^n, p_2^n, \ldots p_k^n) \}.$$

Is there a generalization of the Chi-Square Goodness of Fit test applicable to test this kind of hypothesis?
Looking briefly at the proof the case for the standard case gives me a feeling that it should be possible, but surely I can't be the first one asking this?

Could not find an answer to this as a theorem so had to formulate and prove one by myself.

I have $$n$$ variables $$\{X_{i}\}_{i=1}^n$$ each attaining a value in $$\{1, \ldots k\}$$ and my null hypothesis is that $$P(X_i = j) = p_i^j$$

Then for each $$j \in \{ 1, \ldots k \}$$ Lyapunov's version of CLT yields

$$Z_j := \frac{\sum_{i=1}^n (\mathbb{1}_{\{X_i = j\}} - p_i^j)}{\sqrt{\sum_{i=1}^n p_i^j \cdot (1 - p_i^j)}} \rightarrow N(0, 1).$$

So we have $$k$$ standard normal variables and to proceed we need to figure out their covariance matrix, $$\text{cov}(\overline{Z})$$.

Straightforward computations yield

$$\text{cov}(Z_r, Z_s) = \left\{ \begin{array}{ll} 1 && \text{if } r = s\\ \frac{-\sum_{i = 1}^{n} p_i^r \cdot p_i^s}{\sqrt{(\sum_{i = 1}^{n} p_i^r \cdot (1 - p_i^r)) \cdot (\sum_{i = 1}^{n} p_i^s \cdot (1 - p_i^s))}} && \text{otherwise} \end{array} \right.$$

By construction $$\text{cov}(\overline{Z})$$ is symmetric and real hence by the spectral theorem there are orthonormal eigenvectors of it forming a basis of $$\mathbb{R}^k$$. Letting $$V$$ be the matrix of such eigenvectors and $$D$$ the diagonal matrix of corresponding eigenvalues we have

$$\text{cov}(\overline{Z}) = V D V^{\top}$$.

Furthermore it's easily seen that

$$\text{cov}(V^{\top} \overline{Z}) = V^{\top}\text{cov}(\overline{Z}) V = D$$.

Denote the elements of $$V^{\top} \overline{Z}$$ by $$(\xi_i)_i$$, the elements of $$D$$ by $${(d_{i,j})}_{i, j}$$ (of course $$d_{i,j} = 0$$ for $$i \neq j$$) and define the vector

$$\overline{W} = (\frac{\xi_i}{\sqrt{d_{i,i}}} )_{\{i: d_{i,i} \neq 0\}}$$.

It's now clear that the elements of $$\overline{W}$$ are independent standard normals, with length $$L = \text{rank}(\text{cov}(\overline{Z}))$$.

Hence the distribution of $$\| \overline{W} \|^2$$ is $$\chi^2(L)$$ according to the null hypothesis and this squared norm is what we'll compute when conducting our testing.