Solved – Clustering methods for unknown number of clusters

clusteringdirichlet-processgaussian mixture distributiongibbsnonparametric-bayes

Matrix $X=[x_1,…,x_i,…,x_N]$ is a data-set containing $N$ data-points that each data-point $x_i$ is a vector of $D$ dimensions. Each dimension is a feature. The number of clusters ($K$) is unknown. There is no training data so all of the data-points are unlabeled.

It is assumed that each cluster has Gaussian distribution with parameters mean and sigma: [m , sigma] which m=$[m_1,…,m_D]$.
There is no information about the parameters (mean and sigma) of each cluster.
The feature space of a cluster is modeled as a multi-variable Gaussian ($D$ dimension) and the total feature space is a Gaussian mixture model for unknown number of mixture components ($K$).

I studied a model-based clustering method that has utilized for such a problem. It is nonparametric Bayesian classification (infinite mixture model). Since the number of mixture components is unknown, the nonparametric prior based on Dirichlet process (DP) and the Chinese restaurant process (CRP) for sampling from a DP and the collapsed Gibbs sampling for DP mixture model has used, reference 1.

which other clustering methods (unsupervised classification) can I try for this problem?
In DPMM (Dirichlet process mixture model), it is assumed that each mixture component is Gaussian. Can non-Gaussian distribution be used for mixture components?
In collapsed Gibbs sampling, the number of iterations for the algorithm convergence is assumed fixed. Is it possible the number of iterations is adaptive depend on the data and the number of components?

I asked question 1 generally. I know there are many solutions for one problem. But I seek
what methods are there that they are comparable to DPMM?
Question 2 and 3 are in detail about DPMM.

I just studied about Gibbs sampling and collapsed Gibbs sampling. I want to know about other methods.

On Identifying Primary User Emulation Attacks in Cognitive Radio Systems Using Nonparametric Bayesian Classification

Best Answer

which other clustering methods (unsupervised classification) can I try for this problem?

For instance, parametric ones: you can fit a Gaussian Mixture Model by Expectation Maximization or Variational Bayes Inference; you test for different number of clusters and select the model that best fits your data. Be careful, model selection is not the same for a non-bayesian method (such as EM), where you have a point estimate of the clustering, than for a bayesian method such has VB where you have a full posterior distribution.

In DPMM (Dirichlet process mixture model), it is assumed that each mixture component is Gaussian. Can non-Gaussian distribution be used for mixture components?

The priors of the variables of a simple GMM are: \begin{align*} x_i | z_i, \theta_{z_i} &\sim F(\theta_{z_i})\\ \theta_j &\sim G_0\\ z_i &\sim \text{Discrete}(\boldsymbol{\pi})\\ \pi &\sim \text{Dirichlet}(\boldsymbol{\alpha}) \end{align*}

where $F$ is a Normal distribution and $\pi$ has length $k$. The collapsed version comes after integrating out $\pi$. The Dirichlet Process / Chinese Restaurant Process arrive naturally when you compute the limit $k \rightarrow \infty$.

You can replace the likelihood $F$ by whatever function you like and you will get a mixture of Bernoullis, a mixture of Poissons, a mixture of...

3.a. In collapsed Gibbs sampling, the number of iterations for the algorithm convergence is assumed fixed.

Not at all!. Why should it? Gibbs sampling, and in general any MCMC method, needs some burn-in sampling until your Markov Chain arrives to the stationary distribution (the posterior distribution you are looking for). Once you are there, you keep sampling from your posterior until you are happy. But it is not possible a priori to know how many iterations you need to get to the stationary distribution. Actually, even a posteriori examining your chain of samples (aka traces), there are convergence tests but you never know for sure.

3.b. Is it possible the number of iterations is adaptive depend on the data and the number of components?

So, as for adaptive techniques, you can periodically test whether your chain(s) (one for every variable you sample) converged, and even stop sampling if the test gives positive results. But it might be a false positive. And yet, you might have to check other things such as autocorrelation (to decide the thinning).

Usually the more variables you have (and more components also mean more variables), the longer it will take the Gibbs sampler to converge, but there is no mathematical formula for that.

Related Solutions

Solved – PyMC for nonparametric clustering: Dirichlet process to estimate Gaussian mixture’s parameters fails to cluster

I'm not sure if anyone is looking at this question any more but I put your question in to rjags to test Tom's Gibbs sampling suggestion while incorporating insight from Guy about the flat prior for standard deviation.

This toy problem might be difficult because 10 and even 40 data points are not enough to estimate variance without an informative prior. The current prior σzi∼Uniform(0,100) is not informative. This might explain why nearly all draws of μzi are the expected mean of the two distributions. If it does not alter your question too much I will use 100 and 400 data points respectively.

I also did not use the stick breaking process directly in my code. The wikipedia page for the dirichlet process made me think p ~ Dir(a/k) would be ok.

Finally it is only a semi-parametric implementation since it still takes a number of clusters k. I don't know how to make an infinite mixture model in rjags.

markov chain mu cluster 1

markov chain mu cluster 2

library("rjags")

set1 <- rnorm(100, 0, 1)
set2 <- rnorm(400, 4, 1)
data <- c(set1, set2)

plot(data, type='l', col='blue', lwd=3,
     main='gaussian mixture model data',
     xlab='data sample #', ylab='data value')
points(data, col='blue')

cpd.model.str <- 'model {
  a ~ dunif(0.3, 100)
  for (i in 1:k){
    alpha[i] <- a/k
    mu[i] ~ dnorm(0.0, 0.001)
    sigma[i] ~ dunif(0, 100)
  }
  p[1:k] ~ ddirich(alpha[1:k])
  for (i in 1:n){
    z[i] ~ dcat(p)
    y[i] ~ dnorm(mu[z[i]], pow(sigma[z[i]], -2))
  }
}' 


cpd.model <- jags.model(textConnection(cpd.model.str),
                        data=list(y=data,
                                  n=length(data),
                                  k=5))
update(cpd.model, 1000)
chain <- coda.samples(model = cpd.model, n.iter = 1000,
                      variable.names = c('p', 'mu', 'sigma'))
rchain <- as.matrix(chain)
apply(rchain, 2, mean)

Solved – Dirichlet Processes for clustering: how to deal with labels

My tentative answer would be to treat $\mathbf{c}$ as a parameter so that $p(\mathbf{c},\theta)$ is simply the posterior mode. This is what I suspect Niekum and Barto did (the paper referenced in option 3). The reason they were vague about whether they used $p(\mathbf{c}, \theta)$ or $p(\mathbf{c}|\theta)$ is that one is proportional to the other.

The reason I say this answer is "tentative" is that I'm not sure if designating a value as a "parameter" is just a matter of semantics, or if there's a more technical/theoretical definition that one of the PhD-holding users here would be able to elucidate.

Best Answer

Related Solutions

Solved – PyMC for nonparametric clustering: Dirichlet process to estimate Gaussian mixture’s parameters fails to cluster

Solved – Dirichlet Processes for clustering: how to deal with labels

Related Question