Solved – Dirichlet Processes for clustering: how to deal with labels

bayesianclusteringdirichlet-processidentifiabilitymarkov-chain-montecarlo

Q: What is the standard way to cluster data using a Dirichlet Process?

When using Gibbs sampling clusters appear and dissapear during the sampling. Besides, we have a identifiability problem since the posterior distribution is invariant to cluster relabelings. Thus, we can not say which is the cluster of a user but rather that two users are in the same cluster (that is $p(c_i=c_j)$).

Can we summarize the class assignments so that, if $c_i$ is the cluster assignment of point $i$, we now not only that $c_i=c_j$ but that $c_i=c_j=c_j=…=c_z$?

These are the alternatives I found and why I think they are incomplete or misguided.

(1) DP-GMM + Gibbs sampling + pairs-based confusion matrix

To use a Dirichlet Process Gaussian Mixture Model (DP-GMM) for a clustering I implemented this paper where the authors propose a DP-GMM for density estimation using Gibbs sampling.

To explore the clustering performance, they say:

Since the number of components change over the [MCMC] chain, one would need to form a confusion matrix showing the frequency of each data pair being assigned to the same component for the entire chain, see Fig. 6.
enter image description here

Cons: This is not a real "complete" clustering but a pair-wise clustering. The figure looks that nice because we know the real clusters and arrange the matrix accordingly.

(2) DP-GMM + Gibbs sampling + sample until nothing changes

I have been searching and I found some people claiming to do clustering based on Dirichlet Process using a Gibbs sampler. For instance, this post considers that the chain converges when there are no more changes either in the number of clusters or in the means, and therefore gets the summaries from there.

Cons: I'm not sure this is allowed since, if I'm not wrong:

  • (a) there might be label switchings during the MCMC.

  • (b) even in the stationary distribution the sampler can create some cluster from time to time.

(3) DP-GMM + Gibbs sampling + choose sample with most likely partition

In this paper, the authors say:

After a “burn-in” period, unbiased samples from the posterior
distribution of the IGMM can be drawn from the Gibbs sampler. A hard
clustering can be found by drawing many such samples and using the
sample with the highest joint likelihood of the class indicator
variables. We use a modified IGMM implementation written by M.
Mandel
.

Cons: Unless this is a Collapsed Gibbs Sampler where we only sample the assignments, we can compute $p(\mathbf{c} | \theta)$ but not the marginal $p(\mathbf{c})$. (Would it be a good practice instead to get the state with highest $p(\mathbf{c}, \theta)$?)

(4) DP-GMM with Variatonal Inference:

I've seen that some libraries use variational inference. I don't know Variational Inference very much but I guess that you don't have identifiability problems there. However, I would like to stick to MCMC methods (if possible).

Any reference would be helpful.

Best Answer

My tentative answer would be to treat $\mathbf{c}$ as a parameter so that $p(\mathbf{c},\theta)$ is simply the posterior mode. This is what I suspect Niekum and Barto did (the paper referenced in option 3). The reason they were vague about whether they used $p(\mathbf{c}, \theta)$ or $p(\mathbf{c}|\theta)$ is that one is proportional to the other.

The reason I say this answer is "tentative" is that I'm not sure if designating a value as a "parameter" is just a matter of semantics, or if there's a more technical/theoretical definition that one of the PhD-holding users here would be able to elucidate.

Related Question