Solved – Comparing two discrete distributions (with small cell counts)

chi-squared-testdistributionshypothesis testing

I need to compare sample distributions with theoretical ones, which is typically done with a chi-squared test. The problem is that I have distributions where one or more cells have low values, and consequently the chi-squared test reports very small p-values.
For example, a typical expected and observed frequencies are [152 2 9] and [140 5 18], with a p-value of 0.0007. Based on domain knowledge, these two distributions are not significantly different.

What test could be used instead of chi-squared, which would take out the bias that occurs with the small-valued cells?

Edit: adding some background information for this problem.

I have a number of processes which produce as output certain technical parameters, recorded as time series. I have around 4000 of such process, each producing around 150 such time series (the number of time series a process has follows a power law). I would like to find which of these processes are anomalous, i.e. producing output which is significantly different from others.
To do this I cluster the time series using k-means, and then based on the clusters, produce the "expected" distribution (average over all time series) and the distribution of clusters for each process.

For example, after clustering I might have 4 clusters with following sizes.

Cluster number | Cluster size
-----------------------------
1              | 100
2              | 200
3              | 300
4              | 400

The distribution of the clusters among the processes might be the following

          | Cluster 1 | Cluster 2 | Cluster 3 | Cluster 4
----------------------------------------------------------
Process 1 | 11        | 19        | 35        | 42
Process 2 | 3         | 10        | 14        | 19
Process 3 | 30        | 8         | 12        | 12              <----anomaly
....

In this case, process 1 and process 2 are sufficiently close to the expected, while process 3 has a different distribution from the average. I would like to find a good test to measure this discrepancy. (Any other suggestion for the anomaly detection is also welcome)

Best Answer

There are two technical issues to deal with: (1) measuring the discrepancy between observed and expected and (2) computing the p-value.

We can retain the chi-squared measure of discrepancy (thereby finessing issue 1) and compute an exact p-value. The simple way is to simulate sampling from the expected distribution. Here is the distribution of 10,000 samples performed in R:

Histogram

The actual chi-squared statistic for these data is $549/38 \approx 14.447$. Apparently it is far out in the upper tail of this histogram: only $25$ of the $10,000$ results (0.25%) equal or exceed it. Yes, this proportion is almost four times greater than the approximation of $0.0007$ reported by the chi-squared test, but it's still tiny. We conclude that the observed distribution is significantly different from the expected distribution.

The "domain knowledge" may indeed correctly suggest the amount of difference is not material. That, however, is independent of the finding that the observed frequencies are unlikely to arise randomly from a distribution with the expected frequencies. That is all that statistical significance means.