[Math] Using Chi-Square to test normality.

normal distributionprobability distributionsstatistics

enter image description here

This is a sample question we received. I can't really figure out how to statistically show that this data is normally distributed. We are to used the chi-square method and these are the steps we are to follow:

Use the hypothesized distribution to create expected values for each bin.
Use the observed and expected values to create a chi-squared value.
Compare this computed chi-squared value with the chi-squared value from a table, using the appropriate number of degrees of freedom for this problem and the significance level.
Based on the relationship between the computed chi-squared and the table's chi-squared values, accept or reject the hypothesis.

I have never used this method and was wondering if someone could show me how to work through a problem like this. Thanks!

Best Answer

The chi-squared test is pretty easy to use:

You are told that the engineer has determined that the mean is 10mm and stdev is 0.1 mm. This is all the information you need to specify a normal distribution.

The chi-squared test is based on turning your null hyopthesis into a multinomial distribution, which is a generalization of the binomial distribution (where there are only 2 outcomes). Here, the possibilities are your "bins", where the probability assigned to each "bin" $p_i$ is equal to the probability that a normally distirbuted random variable's value would fall in that bin. As a rule of thumb, if you have N data points, you want to size your bins such tat $Np_i \geq 5$.
Once you have your bins, you calculate the expected number of observations you would have in each bin $E_i$, assuming that your data actually come from a $Normal(10mm,(0.1mm)^2)$ distribution: $E_i=Np_i$.
Now you just need to count the actual number of observations in each bin ($O_i$).
For each bin, you want to form the statistic: $S_i=\frac{(O_i-E_i)^2}{E_i}$
Calculate the Chi-squared statistic $\chi^2 = \sum S_i$
Now here comes the theoretical part: It turns out that when the data actually do come from the multinoimial distribution you constructed (using the underlying hypothesized normal distribution) then the distribution of $\chi^2$ is asymptotically $\chi^2_{k-1}$ where k is the number of bins.
Now, just compare the value of the $\chi^2$ to the $1-\alpha$ percentile of the $\chi^2_{k-1}$ distirution to see if you will reject (i.e., $\chi^2 > 1-\alpha$ percentile of the $\chi^2_{k-1}$)

Related Solutions

[Math] Normalization for Chi square test

In the version of this test that I am familiar with, individual data is categorical, not quantitative like your examples. And the expected and observed values should be frequencies of some category (a count of how many times it occurs), not some individual's quantitative measurement. The numbers that go in to the $E_i$ and $O_i$ positions are unitless, as they are just counts.

So for example, in a box with mixed fruit, maybe 12 pieces were bananas, but you were expecting 15 to be bananas. You will have the term $$\frac{(12-15)^2}{15}$$ and there is no way to rescale units as you did. Writing $$\frac{(12000-15000)^2}{15000}$$ would correspond to a very different scenario. There you would have seen 12000 bananas when you were expecting 15000. And the corresponding $P$ value should be a lot smaller, because it should be a lot less likely to be off by 3000 out of 15000 than 3 out of 15, when you consider the variance from one piece of fruit to the next on its chances to be a banana. So $\chi^2$ should be a lot larger in the latter case.

[Math] Chi square goodness-of-fit test for Uniform distribution using Matlab

Just to expend on it and give the all the code.

Setting parameters

N=100; % sample size
a=0; % lower boundary
b=1; % higher boundary

Sample N uniformly distributed values between a and b. And in the second line add some bais to make it not uniform if you want to test the code.

x=unifrnd(a,b,N,1);
%x(x<.9) = rand(sum(x<.9),1);

Using `chi2gof`

As described here, with chi2gof, you can't use the 'cdf of the hypothesized distribution' and need to specified the bins, the edges and the expected values.

nbins = 10; % number of bin
edges = linspace(a,b,nbins+1); % edges of the bins
E = N/nbins*ones(nbins,1); % expected value (equal for uniform dist)

[h,p,stats] = chi2gof(x,'Expected',E,'Edges',edges)

Using `chi2cdf`

With this function you need to supply the chi-squared test statistic, $\displaystyle \chi ^{2}$ which can be computed with the function histogramm:

h = histogram(x,edges);
chi = sum((h.Values - N/nbins).^2 / (N/nbins));
k = nbins-1; % degree of freedom
chi2cdf(chi, k)

Note, that if you don't use the edges to compute the number of value per bins, histogramm will choose them from the lower value to the highest and therefore the final score will be different than with chi2gof

Quick Theory recall

Just to recall how to interpret the final value, few definition:

The null hypothesis ($H_0$) is that the data x are coming from a uniform distribution.
Pearson's chi-squared test is testing if you can safely reject the null hypothesis, i.e. "Can I say that x is not a coming from a uniform distribution ? "
Pearson's cumulative test statistic $\displaystyle \chi ^{2}$ is a measure of the error between observations and expected value $$\displaystyle \chi ^{2} = \sum_{i=1}^N \frac{(O_i-E_i)^2}{E_i}$$
Chi-squared distribution is the distribution that the Pearson's cumulative test statistic $\displaystyle \chi ^{2}$ would follow according $H_0$ (i.e. if the observation are coming from a uniform distribution)
The p-value is the probability of obtaining a worst result than what was observation, when the null hypothesis is true. That is, the probability that, if we randomly draw a dataset y in a uniform distribution, the error (or Pearson's cumulative test) will be equal of higher than the actual observation x (=worst case).
So, we can reject $H_0$ if p is lower than a significant level $\alpha$. That is, for small value of p, we can safely say that x is not coming from a uniform distribution.