Solved – Dirichlet Processes for clustering: how to deal with labels

bayesianclusteringdirichlet-processidentifiabilitymarkov-chain-montecarlo

Q: What is the standard way to cluster data using a Dirichlet Process?

When using Gibbs sampling clusters appear and dissapear during the sampling. Besides, we have a identifiability problem since the posterior distribution is invariant to cluster relabelings. Thus, we can not say which is the cluster of a user but rather that two users are in the same cluster (that is $p(c_i=c_j)$).

Can we summarize the class assignments so that, if $c_i$ is the cluster assignment of point $i$, we now not only that $c_i=c_j$ but that $c_i=c_j=c_j=…=c_z$?

These are the alternatives I found and why I think they are incomplete or misguided.

(1) DP-GMM + Gibbs sampling + pairs-based confusion matrix

To use a Dirichlet Process Gaussian Mixture Model (DP-GMM) for a clustering I implemented this paper where the authors propose a DP-GMM for density estimation using Gibbs sampling.

To explore the clustering performance, they say:

Since the number of components change over the [MCMC] chain, one would need to form a confusion matrix showing the frequency of each data pair being assigned to the same component for the entire chain, see Fig. 6.

Cons: This is not a real "complete" clustering but a pair-wise clustering. The figure looks that nice because we know the real clusters and arrange the matrix accordingly.

(2) DP-GMM + Gibbs sampling + sample until nothing changes

I have been searching and I found some people claiming to do clustering based on Dirichlet Process using a Gibbs sampler. For instance, this post considers that the chain converges when there are no more changes either in the number of clusters or in the means, and therefore gets the summaries from there.

Cons: I'm not sure this is allowed since, if I'm not wrong:

(a) there might be label switchings during the MCMC.
(b) even in the stationary distribution the sampler can create some cluster from time to time.

(3) DP-GMM + Gibbs sampling + choose sample with most likely partition

In this paper, the authors say:

After a “burn-in” period, unbiased samples from the posterior
distribution of the IGMM can be drawn from the Gibbs sampler. A hard
clustering can be found by drawing many such samples and using the
sample with the highest joint likelihood of the class indicator
variables. We use a modified IGMM implementation written by M.
Mandel.

Cons: Unless this is a Collapsed Gibbs Sampler where we only sample the assignments, we can compute $p(\mathbf{c} | \theta)$ but not the marginal $p(\mathbf{c})$. (Would it be a good practice instead to get the state with highest $p(\mathbf{c}, \theta)$?)

(4) DP-GMM with Variatonal Inference:

I've seen that some libraries use variational inference. I don't know Variational Inference very much but I guess that you don't have identifiability problems there. However, I would like to stick to MCMC methods (if possible).

Any reference would be helpful.

Best Answer

My tentative answer would be to treat $\mathbf{c}$ as a parameter so that $p(\mathbf{c},\theta)$ is simply the posterior mode. This is what I suspect Niekum and Barto did (the paper referenced in option 3). The reason they were vague about whether they used $p(\mathbf{c}, \theta)$ or $p(\mathbf{c}|\theta)$ is that one is proportional to the other.

The reason I say this answer is "tentative" is that I'm not sure if designating a value as a "parameter" is just a matter of semantics, or if there's a more technical/theoretical definition that one of the PhD-holding users here would be able to elucidate.

Related Solutions

Solved – PyMC for nonparametric clustering: Dirichlet process to estimate Gaussian mixture’s parameters fails to cluster

I'm not sure if anyone is looking at this question any more but I put your question in to rjags to test Tom's Gibbs sampling suggestion while incorporating insight from Guy about the flat prior for standard deviation.

This toy problem might be difficult because 10 and even 40 data points are not enough to estimate variance without an informative prior. The current prior σzi∼Uniform(0,100) is not informative. This might explain why nearly all draws of μzi are the expected mean of the two distributions. If it does not alter your question too much I will use 100 and 400 data points respectively.

I also did not use the stick breaking process directly in my code. The wikipedia page for the dirichlet process made me think p ~ Dir(a/k) would be ok.

Finally it is only a semi-parametric implementation since it still takes a number of clusters k. I don't know how to make an infinite mixture model in rjags.

markov chain mu cluster 1

markov chain mu cluster 2

library("rjags")

set1 <- rnorm(100, 0, 1)
set2 <- rnorm(400, 4, 1)
data <- c(set1, set2)

plot(data, type='l', col='blue', lwd=3,
     main='gaussian mixture model data',
     xlab='data sample #', ylab='data value')
points(data, col='blue')

cpd.model.str <- 'model {
  a ~ dunif(0.3, 100)
  for (i in 1:k){
    alpha[i] <- a/k
    mu[i] ~ dnorm(0.0, 0.001)
    sigma[i] ~ dunif(0, 100)
  }
  p[1:k] ~ ddirich(alpha[1:k])
  for (i in 1:n){
    z[i] ~ dcat(p)
    y[i] ~ dnorm(mu[z[i]], pow(sigma[z[i]], -2))
  }
}' 


cpd.model <- jags.model(textConnection(cpd.model.str),
                        data=list(y=data,
                                  n=length(data),
                                  k=5))
update(cpd.model, 1000)
chain <- coda.samples(model = cpd.model, n.iter = 1000,
                      variable.names = c('p', 'mu', 'sigma'))
rchain <- as.matrix(chain)
apply(rchain, 2, mean)

Solved – Clustering methods for unknown number of clusters

which other clustering methods (unsupervised classification) can I try for this problem?

For instance, parametric ones: you can fit a Gaussian Mixture Model by Expectation Maximization or Variational Bayes Inference; you test for different number of clusters and select the model that best fits your data. Be careful, model selection is not the same for a non-bayesian method (such as EM), where you have a point estimate of the clustering, than for a bayesian method such has VB where you have a full posterior distribution.

In DPMM (Dirichlet process mixture model), it is assumed that each mixture component is Gaussian. Can non-Gaussian distribution be used for mixture components?

The priors of the variables of a simple GMM are: \begin{align*} x_i | z_i, \theta_{z_i} &\sim F(\theta_{z_i})\\ \theta_j &\sim G_0\\ z_i &\sim \text{Discrete}(\boldsymbol{\pi})\\ \pi &\sim \text{Dirichlet}(\boldsymbol{\alpha}) \end{align*}

where $F$ is a Normal distribution and $\pi$ has length $k$. The collapsed version comes after integrating out $\pi$. The Dirichlet Process / Chinese Restaurant Process arrive naturally when you compute the limit $k \rightarrow \infty$.

You can replace the likelihood $F$ by whatever function you like and you will get a mixture of Bernoullis, a mixture of Poissons, a mixture of...

3.a. In collapsed Gibbs sampling, the number of iterations for the algorithm convergence is assumed fixed.

Not at all!. Why should it? Gibbs sampling, and in general any MCMC method, needs some burn-in sampling until your Markov Chain arrives to the stationary distribution (the posterior distribution you are looking for). Once you are there, you keep sampling from your posterior until you are happy. But it is not possible a priori to know how many iterations you need to get to the stationary distribution. Actually, even a posteriori examining your chain of samples (aka traces), there are convergence tests but you never know for sure.

3.b. Is it possible the number of iterations is adaptive depend on the data and the number of components?

So, as for adaptive techniques, you can periodically test whether your chain(s) (one for every variable you sample) converged, and even stop sampling if the test gives positive results. But it might be a false positive. And yet, you might have to check other things such as autocorrelation (to decide the thinning).

Usually the more variables you have (and more components also mean more variables), the longer it will take the Gibbs sampler to converge, but there is no mathematical formula for that.

Best Answer

Related Solutions

Solved – PyMC for nonparametric clustering: Dirichlet process to estimate Gaussian mixture’s parameters fails to cluster

Solved – Clustering methods for unknown number of clusters

Related Question