Solved – Clustering 1D data

clustering

I have a dataset, I want to create clusters on that data based on only one variable (there are no missing values). I want to create 3 clusters based on that one variable.

Which clustering algorithm to use, k-means, EM, DBSCAN etc.?

My main question is, in what circumstances should I use k-means over EM or EM over k-means?

Best Answer

The K-means algorithm and the EM algorithm are going to be pretty similar for 1D clustering.

In K-means you start with a guess where the means are and assign each point to the cluster with the closest mean, then you recompute the means (and variances) based on current assignments of points, then update the assigment of points, then update the means ...

In EM you would also start with a guess where the means are, then you compute the expected value of the assignments (essentially the probability of each point being in each cluster), then you update the estimated means (and variances) using the expected values as weights, then compute new expected values, then compute new means, ...

The primary difference is that the assignment of points to clusters in K-means is an all or nothing, where EM gives proportions/probability of group membership (one point may be seen as having 80% probability of being in group A, 18% probability of being in group B, and 2% probability of being in group C). If there is a lot of seperation between the groups then the 2 methods are going to give pretty similar results. But if there is a fair amount of overlap then the EM will probably give more meaningful results (even more if the variance/standard deviation is of interest). But if all you care about is assigning group membership without caring about the parameters, then K-means is probably simpler.

Why not do both and see how different the answers are? if they are similar then go with the simpler one, if they are different then decide on comparing the grouping to the data and outside knowledge.