Solved – Comparing Categorical distributions with very different frequencies

categorical datachi-squared-testdistributions

I want to compare distributions of reads of a particular length across two different genes and test whether they are statistically different from each other.

For example, if we have the read counts for each category as a list.

gene1 = [140, 280, 122, 544, 681, 1461, 457, 10660, 1133, 770, 
         5903, 716, 9209, 2828, 2192, 4647, 1027, 1129, 2407, 
         292, 852, 851, 136, 392]

gene2 = [ 5, 4, 20, 8, 701, 221, 233, 480, 305, 1062, 424, 1023, 
         474, 1071, 363, 279, 319, 64, 1240, 55, 79, 163, 6, 24]

The list index is the category (read length) and it is corresponding to index in the other list. As you can see, the read count (frequency in each category) can differ greatly between the two genes for each category.

If I apply a chi-square test of independence, it results in a very high chi-square statistic due to huge difference in frequencies of corresponding categories. Also there can be instances where frequency is less than 5. Hence I seriously doubt the validity of applying this test.

I wanted to know if there is any better way to compare these distributions?

I can calculate proportions within each category (to normalize to same scale) but I am not sure how I compare them between the two distributions.

Edit: Why is it problematic?

The distributions of the two sets are very similar but the magnitude of counts are very different where a negligible difference in probability density is a very high difference in actual counts and hence chi-sq is very high leading to significant p-value when I expect it to be insignificant.

View post on imgur.com

Best Answer

Pearson's Chi-Square test would work best for large sample sizes, where the limiting distribution of the test-statistic is Chi-Square. In your case, for 24 observations, there are other (better) alternatives you can explore.

The Kolmogorov-Smirnov test is one of the most useful nonparametric tests for comparing two sampling distributions. K-S tests are particularly very powerful for continuous data, where the assumption that the K-S statistic (which is a distance metric between the two empirical CDFs) converges asymptotically to the Kolmogorov distribution if F is continuous. Again, both the large sample size and the discretization of your distribution are points which may invalidate the conditions necessary for the K-S test.

Permutation tests are especially powerful in overcoming the obstacles in performing K-S test and Pearson's test. Permutation tests (or Randomization tests) very much have the flavour of the Bootstrap - as it resamples the data. But, unlike the Bootstrap permutations of the data are used instead of resampling with replacement. The added flexibility here is that you can measure the deviation between the two distributions using any distance metric (e.g. the K-S statistic itself, difference in medians or ratio of means, etc.). This is fairly simple to implement in any programming framework.

Alternatively, you can just run two-sample bootstrap test using a distance metric between the two distributions and test your hypothesis.

Ultimately, it will boil down to the methodology you are comfortable with and your assessment of the applicability of underlying assumptions associated with each methodology.

Related Solutions

Solved – Comparing two discrete distributions (with small cell counts)

There are two technical issues to deal with: (1) measuring the discrepancy between observed and expected and (2) computing the p-value.

We can retain the chi-squared measure of discrepancy (thereby finessing issue 1) and compute an exact p-value. The simple way is to simulate sampling from the expected distribution. Here is the distribution of 10,000 samples performed in R:

Histogram

The actual chi-squared statistic for these data is $549/38 \approx 14.447$. Apparently it is far out in the upper tail of this histogram: only $25$ of the $10,000$ results (0.25%) equal or exceed it. Yes, this proportion is almost four times greater than the approximation of $0.0007$ reported by the chi-squared test, but it's still tiny. We conclude that the observed distribution is significantly different from the expected distribution.

The "domain knowledge" may indeed correctly suggest the amount of difference is not material. That, however, is independent of the finding that the observed frequencies are unlikely to arise randomly from a distribution with the expected frequencies. That is all that statistical significance means.

Solved – Comparing (and testing) two discrete distributions with different magnitudes

You could present relative frequencies of people found in the social network, i.e. "value over pop"

 type Percent
    1  0.0473
    2  0.0246
    3  0.0113
    4  0.0118

and just compare these percentages.

enter image description here

As the numbers and also the barplot show, the relative frequencies of people found in the districts vary quite a bit, i.e. not all districts are equally represented in the social network.

I doubt whether it is useful to use methods from inductive statistics here because your data set does not seem to be a random sample from a population. Should my impression be wrong, then you could either think of adding binomial confidence intervals to each of those percentages and/or run a chi-squared goodness-of-fit test using the population distribution as the reference.

In R:

N <- c(2248360, 3544721, 70934, 2090647)
n <- c(1064, 873, 8, 246)
chisq.test(n, p = N/sum(N))

# Output
        Chi-squared test for given probabilities

data:  n
X-squared = 526.0491, df = 3, p-value < 2.2e-16

At the 5% level, you could reject the null hypothesis that all districts are equally represented in the social network.

Best Answer

Related Solutions

Solved – Comparing two discrete distributions (with small cell counts)

Solved – Comparing (and testing) two discrete distributions with different magnitudes

Related Question