Solved – approach for comparing two groups of frequency distributions

distributionsfrequencykolmogorov-smirnov teststatistical significance

I need to test if there is a difference in frequency distribution of a variable between two groups of subjects. Each subject is characterized by a list of values, from which a frequency distribution can be constructed. To be more clear, each subject is an animal, and I measure several hundred cells per animal; there is a value associated with each cell. I then create a frequency distribution for each subject/animal (all possible values of the variable are divided into bins, and fraction of cells within each bin is calculated). I need to determine if there are any differences in the shape of the frequency distribution between two groups of animals.

If I had to compare two animals with each other (simply two frequency distributions), the Kolmogorov-Smirnov test seems to be appropriate. But the problem is that I need to compare two groups of animals. In other words, there is some variability among individual subjects within one group, and there is uncertainty regarding true average frequency distribution of this population of subjects, which I feel I need to capture.
What would be the proper statistical test/approach to do this?

To illustrate the data, here are some sample graphs, individual frequency distributions for 4 subjects from each group, and mean frequency distributions with error bars (s.e.m.) Once again, the research question is – are there any statistically significant differences in the distributions of values between groups? Less formally, does the red group have more cells with higher values, is there a shift to the right in the values? (because these are percentages of values, the total area under each curve is 100%).

Individual frequency distributions for 4 subjects from both groups

Mean frequency distributions

Best Answer

OK, let me see if i understood your quetsion correctly. So you have two groups $X = (x_{1}, x_{2}, \ldots, x_{n})$ and $Y = (y_{1}, y_{2}, \ldots, y_{n})$ and now you want to determine (or maybe visualize) differences regarding specific feature pairs $(x_{i}, y_{i})$ ?

That looks for me to be easily possible by applying a simple distance function on $X$ and $Y$, as for instance the Manhatten distance: $$M(X, Y) = \sum^{n}_{i = 1} |x_{i} - y_{i}|$$ The resulting distance tells you in general if both groups are (dis)similar. If you now want to find specific elements that differ from each other i would recommend to plot both groups which, of course, depends how large $n$ is. Before applying the distance function on both groups, i further suggest to scale the features, e.g., in the intervall $[0 ; 1]$. This can be done for instance through min-max scaling.

Note, instead of the Manhatten distance there are dozens of other metrics, which might fit better to the specific scenario. I highly recommend Michel and Elena Deza's "Encyclopedia of Distances".