Solved – How Do You Choose The Number of Bins To Use For A Chi-Squared GOF Test

I'm working on developing a physics lab about radioactive decay, and in analyzing sample data I've taken, I ran into a statistics issue that surprised me.

It is well known that the number of decays per unit time by a radioactive source is Poisson distributed. The way the lab works is that students count the number of decays per time window, and then repeat this many many times. Then they bin their data by the number of counts, and do a $\chi^2$ goodness of fit test with 1 parameter estimated (the mean) to check whether or not the null hypothesis (the data is drawn from a Poisson distribution with the estimated mean value) holds. Hopefully they'll get a large p-value and conclude that physics indeed works (yay).

I noticed that the way I binned my data had a large effect on the p-value. For example, if I chose lots of very small bins (e.g. a separate bin for each integer: 78 counts/min, 79 counts/min, etc.) I got a small p-value, and would have had to reject the null hypothesis. If, however, I binned my data into fewer bins (e.g. using the number of bins given by Sturge's Rule: $1+log_{2}(N)$), I got a much larger p-value, and did NOT reject the null hypothesis.

Looking at my data, it looks extremely Poisson-distributed (It lines up almost perfectly with my expected counts/minutes). That said, there are a few counts in bins very far away from the mean. That means when computing the $\chi^2$ statistic using very small bins, I have a few terms like:
$$\frac{(Observed-Expected)^2}{Expected} = \frac{(1-0.05)^2}{0.05}=18.05$$
This leads to a high $\chi^2$ statistic, and thus a low p-value. As expected, the problem goes away for larger bin widths, since the expected value never gets that low.

Questions:

Is there a good rule of thumb for choosing bin sizes when doing a $\chi^2$ GOF test?

Is this discrepancy between outcomes for different bin sizes something that I should have known about*, or is indicative of some larger problem in my proposed data analysis?

–
Thank you

*(I took a stats class in undergrad, but it's not my area of expertise.)

Best Answer

The binning of the radioactive decay sample set is a red herring here. The real problem originates from the fact that chi-square (alongside other hypothesis testing frameworks) is highly sensitive to sample size. In the case of chi-square, as sample size increases, absolute differences become an increasingly smaller portion of the expected value. As such, if the sample size is very large we may find small p-values and statistical significance when the findings are small and uninteresting. Conversely, a reasonably strong association may not come up as significant if the sample size is small.

Is there a good rule of thumb for choosing bin sizes when doing a χ2 GOF test?

The answer seems that one should not aim to find the right N (I am not sure it is doable, but would be great if someone else chips in to contradict), but look beyond p-values solely when N is high. This seems a good paper on the subject: Too Big to Fail: Large Samples and the p-Value Problem

P.S. There are alternatives to χ2 test such as Cramer's V and G-Test; however you will still hit the same issues with large N -> small p-value.

Best Answer

Related Solutions

Solved – How the Pearson’s Chi Squared Test works

Solved – Proper use and interpretation of chi-squared test

Related Question