Solved – Clustering probability distributions – methods & metrics

clusteringdistributionsfeature selectionk-meanskolmogorov-smirnov test

I have some data points, each containing 5 vectors of agglomerated discrete results, each vector's results generated by a different distribution, (the specific kind of which I am not sure, my best guess is Weibull, with shape parameter varying somewhere around exponential to power law (1 to 0, roughly).)

I am looking to use a clustering algorithm like K-Means to put each data point into groups based on the attributes of its 5 component distributions. I was wondering if there are any established distance metrics that would be elegant for these purposes. I have had three ideas so far, but I'm not a seasoned statistician (more of a beginning data-mining computer scientist) so I have little idea how far I am off track.

  1. Since I don't know exactly what kind of distributions I'm dealing with, my brute-force approach to the problem was to chop each of the distributions (I have 5 per point) into each of its respective discrete data values (I pad each corresponding one to the same length with zeros at the end) and use each of these values as a separate attribute for the data point itself. I tried using both Manhattan distance and Euclidean distance as metrics based on these attributes, for both the PDF and CDF.

  2. Again, since I don't know what kinds of distributions I have, I figured that if I was going to measure the distance between the overall distributions I could use some sort of non-parametric test pairwise between distributions, such as the KS-test, to find the likelihood that the given distributions were generated by different PDFs. I thought that my first option (above) using the Manhattan distance would be a sort of upper bound on what I might get using this approach (since the KS statistic is the max absolute value of the difference of the CDFs, where Manhattan distance is the sum of the absolute values of the differences in the PDFs). I then considered combining the different KS-Statistics or P-values within each data point, probably using Euclidean distance, but possibly just taking the max of all of these values.

  3. Lastly, in an effort to use what little I can interpret about the shape of the distributions, I thought I might try estimating the parameters of the distributions as fit into a Weibull curve. I could then cluster the distributions based on differences in the two parameters of the Weibull distribution, lambda and k (scale and shape), probably normalized according to the variance of these parameters or something of the sort. This is the only case where I thought I might have an idea of how to normalize the parameters.

So my question is, what measure/methods would you recommend for clustering of distributions? Am I even on the right track with any of these? Is K-Means even a good algorithm to use?

Edit: Clarification of data.

Each data point (each object Obj that I want to cluster) actually literally contains 5 vectors of data. I know there are exactly 5 phases that these objects can be in. We'll say (for the purposes of simplification) that each vector is of length N.

Each one of these vectors (call it vector i) is a probability distribution with integer x-values of 1 through N, where each corresponding y-value represents the probability of measuring value x in phase i of the object Obj. N is then the maximum x-value I expect to measure in any phase of the object (this is not actually a fixed number in my analysis).

I determine these probabilities in the following manner:

  1. I take a single Obj and put it in phase i for k trials, taking a measurement at each trial. Each measurement is a single whole number. I do this for each of 5 phases of a single object, and in turn for each object. My raw measurement data for a single object might look like:

    Vector 1. [90, 42, 30, 9, 3, 4, 0, 1, 0, 0, 1]

    Vector 2. [150, 16, 5, 0, 1, 0, 0, 0, 0, 0, 0]

    Vector 5. [16, … …, 0]

  2. Then I normalize each of the vectors on its own, with respect to the total number of measurements in that given vector. This gives me a probability distribution in that vector, where each corresponding y-value represents the probability of measuring value x in phase i.

Best Answer

(Computational) Information Geometry is a field which deals exactly with these kind of problems. K-means has an extension called Bregman k-means which use divergences (whose squared Euclidean of the standard K-means is a particular case, but also Kullback-Leibler). A given divergence is associated to a distribution, e.g. squared Euclidean to Gaussian.

You can also have a look on the work of Frank Nielsen, for example

You can also have a look on Wasserstein distances (optimal transport), mentioned as Earth Mover Distance in a previous post.