Solved – Clustering of distributions in R

clusteringdistributionshierarchical clusteringmodel-based-clusteringr

I have a set of distributions corresponding to predictions for how each of hundreds of players will perform. I am looking to identify the distinct distributions of players. In other words, I'm looking to identify the distinct distributions in a group of distributions.

I know Mclust() can perform clustering on a vector, e.g.:

library("mclust")

mydata <- c(1,1,2,2,3,3,5,7,8,9,10)

summary(Mclust(mydata), parameters=TRUE)
Mclust(mydata)$classification

However, my data are a series of vectors (i.e., distributions)—one vector for each player, e.g.:

set.seed(12345)
playerA <- rnorm(10, mean=1, sd=.1)
playerB <- rnorm(100, mean=1, sd=1)
playerC <- rnorm(10, mean=2, sd=1)
playerD <- rnorm(5, mean=2, sd=2)
playerE <- rnorm(2, mean=3, sd=1)
playerF <- rnorm(20, mean=5, sd=1)
playerG <- rnorm(100, mean=7, sd=.5)
playerH <- rnorm(10, mean=8, sd=2)
playerI <- rnorm(5, mean=9, sd=1)
playerJ <- rnorm(10, mean=10, sd=.5)

How can I perform clustering to identify the distinct clusters of players based on their distributions, focusing on differences in their means, rather than their variances. I don't want to just cluster the mean values, though, because I want to take into account the variances to know whether their means are in the same or in a different cluster (e.g., high variability in two players' distributions may indicate that two players with different means are in the same cluster). Ideally, I'd like two players with the same mean and different variability distributions to be in the same cluster. Is there a way to do this using the mclust or another package in R? I've considered doing pairwise t-tests, but this seems that it would be heavily dependent on the sample size in each distribution (which I'd rather it not be too dependent on sample size, if possible). I've also considered comparisons based on effect size (Cohen's d). I'm not sure what other options there are (e.g., Tukey's HSD, hierarchical clustering, etc.)

Best Answer

A simple way to approach this will be to cluster samples based on their means and then check the distribution of the variances across the members of each cluster. It will be dependent on whether you want to cluster using only the averages or use the entire distribution. These are two different questions. For the second part, you could use something like the bhattacharya coefficient or the Kullback-Leiber distance as the (dis)similarity measure. Bear in mind that the two approaches will give you different clusters.I think you can also specify different assumptions for the underlying variances in the mclust package.

Related Question