Solved – Clustering distributions

clusteringdistributionshistogram

I have several distributions (10 distributions in the figure below). distributions

In fact these are histograms: there are 70 values on the x-axis which are the sizes of some particles in a solution and for each value of x the corresponding value of y is the proportion of particles whose size is around the value of x.

I would like to cluster these distributions. Currently I use a hierarchical clustering with the Euclidean distance for example. I am not satisfied by the choice of the distance. I have tried information-theoretic distance such as Kullback-Leibler but there are many zeros in the data and this causes difficulties.
Do you have a proposal of an appropriate distance and/or another clustering method ?

Best Answer

I understand you such that all distributions can potentially take on the same 70 discrete values. Then it will be easy for you to compare cumulative curves of the distributions (comparing cumulative curves is the general way to compare distributions). That will be omnibus comparison for differences in shape, location, and spread.

So, prepare data in the form like (A, B, ... etc are the distributions)

Value CumProp_A CumProp_B ...
1       .01       .05
2       .12       .14
...     ...       ...
70      1.00      1.00

and compute a distance matrix between the distributions. Submit to hierarchical clustering (I'd recommend complete linkage method). What distance? Well, if you think two cumulative curves are very different if they are far apart just at one value (b), use Chebyshev distance. If you think two cumulative curves are very different only if one is stably above the other along a wide range of values (c), use autocorrelative distance. In case any local differences between the curves are important (a), use Manhattan distance.

enter image description here

P.S. Autocorrelative distance is just a non-normalized coefficient of autocorrelation of differences between the cumulative curves X and Y:

$\sum_{i=2}^N (X-Y)_i*(X-Y)_{i-1}$