Solved – Chi squared test on binned data: how many degrees of freedom

chi-squared-testdegrees of freedompoisson distributionpoisson-regression

I have 1000 binned data points I would like to fit a poisson distribution to. I define $N_{i}$ to be the number of times I measure $i$ counts in a fixed time period. I calculate the mean $\hat{\mu} = \frac{1}{N_{\mathrm{tot}}} \Sigma_{i} i N_{i}$, and use this to define my fitted distribution.

I would now like to perform a $\chi^{2}$ test to test the goodness of fit. I calculate

$\chi^{2} = \Sigma^{\infty}_{i = 0} \frac{(N_{i} – <N_{i}>)^{2}}{<N_{i}>}$.

Where $i$ is the observed number of counts in some time period, $N_{i}$ is the observed frequency, and $<N_{i}>$ is the frequency predicted by my distribution. I appreciate that this does not account for the non-gaussian error on the counts in each bin.

There is only data in the first 7 bins i.e. $i = 0$ to $6$. However, it is implied that all bins $i \ge 7$ contain zero counts. I therefore get

$\chi^{2} = \Sigma^{6}_{i = 0} \frac{(N_{i} – <N_{i}>)^{2}}{<N_{i}>} + \Sigma^{\infty}_{i = 7} <N_{i}>$.

Which I can calculate, and the last term is only a small contribution.

My question is therefore this: when I go to calculate my confidence levels and things from this chi-squared value, how many degrees of freedom do I have? I have 7 bins with data in, but there are also infinitely many empty bins! Should these be counted? Indeed, is $\chi^{2}$ even the correct test for this data?

Best Answer

Your chi-square approximation will tend to be poor if you don't have substantial expected counts in almost all cells (with unequal expectations, you would want to have most cells with expected count a deal larger than 1 typically, and many people use a rule that says you should have at least 5 in all cells -- thought that's often too stringent).

Unfortunately, you don't have much choice other than to combine cells based on the observed data (since you don't have a priori expected proportions). However, I think the impact of doing so should be small (one can always simulate to assess its effect in problems similar to yours).

Also see: Impact of data-based bin boundaries on a chi-square goodness of fit test?

If you only have data out to 6, your parameter will be somewhere in the vicinity of 1 to 1.5 or thereabouts.

At the least I'd combine all the cells into the sixth one.

You lose 1 df from that 6 and you lose slightly less than another df estimating the mean (you're using more information than was in your table you're working with the complete data rather than the re-binned data; in this case that should make almost no difference).

So it looks to me like you'd be left with 5 df (and a bit, though the statistic won't really have a chi-square distribution in any case, except fairly approximately). Note that the chi-square ignores the ordering in the cells so it relatively low-power at picking up (say) a distribution with higher dispersion than the Poisson.