# Background

Suppose we observe $n$ IID Bernoulli variables and our null hypothesis is that their common probability is $p$.

For denote by $\mathbb{1}_{\{i\}}$ the outcome of observation $i$.

Then by the central limit theorem

$\frac{\frac{1}{\sqrt{n}}\sum_{i=1}^n (\mathbb{1}_{\{i\}} – p)}{\sqrt{p \cdot (1 – p)}} \rightarrow N(0, 1),$

which can be used for hypothesis testing.

Suppose now that the null hypothesis is instead that each variable has an individual probability of success, $p_i$ (they are still independent). Then a simple argument allows us to use Lyapunov's version of the CLT and can thus conclude

$\frac{\sum_{i=1}^n (\mathbb{1}_{\{i\}} – p_i)}{\sqrt{\sum_i p_i \cdot (1 – p_i)}} \rightarrow N(0, 1),$ which can then be used to test this composite hypothesis.

# Question

If instead we have $k$ categories and our null hypothesis is that the probabilities for each is $p_k$ and we have observed the $n_i$ occurrences of each outcome $i$ then we can use the Chi-Square Goodness of Fit test stating that if the $n_i$ sum to $n$ then

$\sum_{i=1}^k \frac{(n_i – n \cdot p_i)^2}{n \cdot p_i} \rightarrow \chi^2(k – 1)$.

Analogously to above, instead I want to form a null hypothesis where I conduct $n$ experiments, but for each of them the $k$ categories have separate probabilities $\{(p_1^1, p_2^1, \ldots p_k^1), (p_1^2, p_2^2, \ldots p_k^2), \ldots, (p_1^n, p_2^n, \ldots p_k^n) \}.$

Is there a generalization of the Chi-Square Goodness of Fit test applicable to test this kind of hypothesis?

Looking briefly at the proof the case for the standard case gives me a feeling that it should be possible, but surely I can't be the first one asking this?

## Best Answer

Could not find an answer to this as a theorem so had to formulate and prove one by myself.

I have $n$ variables $\{X_{i}\}_{i=1}^n$ each attaining a value in $\{1, \ldots k\}$ and my null hypothesis is that $P(X_i = j) = p_i^j$

Then for each $j \in \{ 1, \ldots k \}$ Lyapunov's version of CLT yields

$Z_j := \frac{\sum_{i=1}^n (\mathbb{1}_{\{X_i = j\}} - p_i^j)}{\sqrt{\sum_{i=1}^n p_i^j \cdot (1 - p_i^j)}} \rightarrow N(0, 1).$

So we have $k$ standard normal variables and to proceed we need to figure out their covariance matrix, $\text{cov}(\overline{Z})$.

Straightforward computations yield

$\text{cov}(Z_r, Z_s) = \left\{ \begin{array}{ll} 1 && \text{if } r = s\\ \frac{-\sum_{i = 1}^{n} p_i^r \cdot p_i^s}{\sqrt{(\sum_{i = 1}^{n} p_i^r \cdot (1 - p_i^r)) \cdot (\sum_{i = 1}^{n} p_i^s \cdot (1 - p_i^s))}} && \text{otherwise} \end{array} \right.$

By construction $\text{cov}(\overline{Z})$ is symmetric and real hence by the spectral theorem there are orthonormal eigenvectors of it forming a basis of $\mathbb{R}^k$. Letting $V$ be the matrix of such eigenvectors and $D$ the diagonal matrix of corresponding eigenvalues we have

$\text{cov}(\overline{Z}) = V D V^{\top}$.

Furthermore it's easily seen that

$\text{cov}(V^{\top} \overline{Z}) = V^{\top}\text{cov}(\overline{Z}) V = D$.

Denote the elements of $V^{\top} \overline{Z}$ by $(\xi_i)_i$, the elements of $D$ by ${(d_{i,j})}_{i, j}$ (of course $d_{i,j} = 0$ for $i \neq j$) and define the vector

$\overline{W} = (\frac{\xi_i}{\sqrt{d_{i,i}}} )_{\{i: d_{i,i} \neq 0\}}$.

It's now clear that the elements of $\overline{W}$ are independent standard normals, with length $L = \text{rank}(\text{cov}(\overline{Z}))$.

Hence the distribution of $\| \overline{W} \|^2$ is $\chi^2(L)$ according to the null hypothesis and this squared norm is what we'll compute when conducting our testing.