Non-Distance Based Clustering Algorithms – Are There Any?

clusteringdata miningk-meansmachine learning

It seems that for K-means and other related algorithms, clustering is based off calculating distance between points. Is there one that works without it?

Best Answer

One example of such a method are Finite Mixture Models (e.g. here or here) used for clustering. In FMM you consider the distribution ($f$) of your variable $X$ as a mixture of $K$ distributions ($f_1,...,f_k$):

$$f(x, \vartheta) = \sum^K_{k=1} \pi_k f_k(x, \vartheta_k)$$

where $\vartheta$ is a vector of parameters $\vartheta = (\pi', \vartheta_1', ..., \vartheta_k')'$ and $\pi_k$ is a proportion of $k$'th distribution in the mixture and $\vartheta_k$ is a parameter (or parameters) of $f_k$ distribution.

A specific case for discrete data is Latent Class Analysis (e.g. Vermunt and Magidson, 2003) defined as:

$$P(x, k) = P(k) P(x|k)$$

where $P(k)$ is probability of observing latent class $k$ (i.e. $\pi_k$), $P(x)$ is probability of observing an $x$ value and $P(x|k)$ is probability of $x$ being in class $k$.

Usually for both FMM and LCA EM algorithm is used for estimation, but Bayesian approach is also possible, but a little bit more demanding because of problems such as model identification and label switching (e.g. Xi'an's blog).

So there is no distance measure but rather a statistical model defining the structure (distribution) of your data. Because of that other name of this method is "model-based clustering".

Check the two books on FMM:

One of the most popular clustering packages that uses FMM is mclust (check here or here) that is implemented in R. However, more complicated FMM's are also possible, check for example flexmix package and it's documentation. For LCA there is an R poLCA package.