[Math] Chi-square goodness of fit test proof

chi squaredstatistics

I understand the classcial $\chi^2$ "goodness of fit" test used in Statistics, in which we compute $\sum_{i=1}^n \frac{(O_i – E_i)^2}{E_i}$ and, by comparing this quantity to a value found in a table of $\chi^2$ law (with a given risk $\alpha = 5\%$ for example), we decide if we should or not accept the hypothesis that the sample is likely to be an observation or not of a given distribution.

But I haven't found a good precise proof online yet, that shows that it's not only a good "recipe", but also has a strict proof, using probability theory (I know it exists, but I haven't found one yet).

Do you know a good detailed proof?

Best Answer

I think I finally found one in this document, page 2.

Theorem (Pearson): The random variable

$$\sum_{j=1}^n \frac{(\nu_j-n p_j)^2}{n p_j} \rightarrow \chi_{r-1}^2$$ converges in distribution to $\chi_{r-1}^2$-distribution with $r − 1$ degrees of freedom.

Setting parameters

N=100; % sample size
a=0; % lower boundary
b=1; % higher boundary

Sample N uniformly distributed values between a and b. And in the second line add some bais to make it not uniform if you want to test the code.

x=unifrnd(a,b,N,1);
%x(x<.9) = rand(sum(x<.9),1);

Using `chi2gof`

As described here, with chi2gof, you can't use the 'cdf of the hypothesized distribution' and need to specified the bins, the edges and the expected values.

nbins = 10; % number of bin
edges = linspace(a,b,nbins+1); % edges of the bins
E = N/nbins*ones(nbins,1); % expected value (equal for uniform dist)

[h,p,stats] = chi2gof(x,'Expected',E,'Edges',edges)

Using `chi2cdf`

With this function you need to supply the chi-squared test statistic, $\displaystyle \chi ^{2}$ which can be computed with the function histogramm:

h = histogram(x,edges);
chi = sum((h.Values - N/nbins).^2 / (N/nbins));
k = nbins-1; % degree of freedom
chi2cdf(chi, k)

Note, that if you don't use the edges to compute the number of value per bins, histogramm will choose them from the lower value to the highest and therefore the final score will be different than with chi2gof

Quick Theory recall

Just to recall how to interpret the final value, few definition:

The null hypothesis ($H_0$) is that the data x are coming from a uniform distribution.
Pearson's chi-squared test is testing if you can safely reject the null hypothesis, i.e. "Can I say that x is not a coming from a uniform distribution ? "
Pearson's cumulative test statistic $\displaystyle \chi ^{2}$ is a measure of the error between observations and expected value $$\displaystyle \chi ^{2} = \sum_{i=1}^N \frac{(O_i-E_i)^2}{E_i}$$
Chi-squared distribution is the distribution that the Pearson's cumulative test statistic $\displaystyle \chi ^{2}$ would follow according $H_0$ (i.e. if the observation are coming from a uniform distribution)
The p-value is the probability of obtaining a worst result than what was observation, when the null hypothesis is true. That is, the probability that, if we randomly draw a dataset y in a uniform distribution, the error (or Pearson's cumulative test) will be equal of higher than the actual observation x (=worst case).
So, we can reject $H_0$ if p is lower than a significant level $\alpha$. That is, for small value of p, we can safely say that x is not coming from a uniform distribution.

Best Answer

Related Solutions

[Math] Difference between Chi-Square Test (goodness of fit) and binomial

[Math] Chi square goodness-of-fit test for Uniform distribution using Matlab

Setting parameters

Using chi2gof

Using chi2cdf

Quick Theory recall

Related Question

Using `chi2gof`

Using `chi2cdf`