Solved – Clustering methods for unknown number of clusters

clusteringdirichlet-processgaussian mixture distributiongibbsnonparametric-bayes

Matrix $X=[x_1,…,x_i,…,x_N]$ is a data-set containing $N$ data-points that each data-point $x_i$ is a vector of $D$ dimensions. Each dimension is a feature. The number of clusters ($K$) is unknown. There is no training data so all of the data-points are unlabeled.

It is assumed that each cluster has Gaussian distribution with parameters mean and sigma: [m , sigma] which m=$[m_1,…,m_D]$.
There is no information about the parameters (mean and sigma) of each cluster.
The feature space of a cluster is modeled as a multi-variable Gaussian ($D$ dimension) and the total feature space is a Gaussian mixture model for unknown number of mixture components ($K$).

I studied a model-based clustering method that has utilized for such a problem. It is nonparametric Bayesian classification (infinite mixture model). Since the number of mixture components is unknown, the nonparametric prior based on Dirichlet process (DP) and the Chinese restaurant process (CRP) for sampling from a DP and the collapsed Gibbs sampling for DP mixture model has used, reference 1.

  1. which other clustering methods (unsupervised classification) can I try for this problem?

  2. In DPMM (Dirichlet process mixture model), it is assumed that each mixture component is Gaussian. Can non-Gaussian distribution be used for mixture components?

  3. In collapsed Gibbs sampling, the number of iterations for the algorithm convergence is assumed fixed. Is it possible the number of iterations is adaptive depend on the data and the number of components?

I asked question 1 generally. I know there are many solutions for one problem. But I seek
what methods are there that they are comparable to DPMM?
Question 2 and 3 are in detail about DPMM.

I just studied about Gibbs sampling and collapsed Gibbs sampling. I want to know about other methods.

On Identifying Primary User Emulation Attacks in Cognitive Radio Systems Using Nonparametric Bayesian Classification

Best Answer

  1. which other clustering methods (unsupervised classification) can I try for this problem?

For instance, parametric ones: you can fit a Gaussian Mixture Model by Expectation Maximization or Variational Bayes Inference; you test for different number of clusters and select the model that best fits your data. Be careful, model selection is not the same for a non-bayesian method (such as EM), where you have a point estimate of the clustering, than for a bayesian method such has VB where you have a full posterior distribution.

  1. In DPMM (Dirichlet process mixture model), it is assumed that each mixture component is Gaussian. Can non-Gaussian distribution be used for mixture components?

The priors of the variables of a simple GMM are: \begin{align*} x_i | z_i, \theta_{z_i} &\sim F(\theta_{z_i})\\ \theta_j &\sim G_0\\ z_i &\sim \text{Discrete}(\boldsymbol{\pi})\\ \pi &\sim \text{Dirichlet}(\boldsymbol{\alpha}) \end{align*}

where $F$ is a Normal distribution and $\pi$ has length $k$. The collapsed version comes after integrating out $\pi$. The Dirichlet Process / Chinese Restaurant Process arrive naturally when you compute the limit $k \rightarrow \infty$.

You can replace the likelihood $F$ by whatever function you like and you will get a mixture of Bernoullis, a mixture of Poissons, a mixture of...

3.a. In collapsed Gibbs sampling, the number of iterations for the algorithm convergence is assumed fixed.

Not at all!. Why should it? Gibbs sampling, and in general any MCMC method, needs some burn-in sampling until your Markov Chain arrives to the stationary distribution (the posterior distribution you are looking for). Once you are there, you keep sampling from your posterior until you are happy. But it is not possible a priori to know how many iterations you need to get to the stationary distribution. Actually, even a posteriori examining your chain of samples (aka traces), there are convergence tests but you never know for sure.

3.b. Is it possible the number of iterations is adaptive depend on the data and the number of components?

So, as for adaptive techniques, you can periodically test whether your chain(s) (one for every variable you sample) converged, and even stop sampling if the test gives positive results. But it might be a false positive. And yet, you might have to check other things such as autocorrelation (to decide the thinning).

Usually the more variables you have (and more components also mean more variables), the longer it will take the Gibbs sampler to converge, but there is no mathematical formula for that.

Related Question