The grid mapping you describe is a form of index: with finite precision, there are a fixed number of values that the mean of each component can have, so you can reduce any mean vector to a scalar pointer to the index. You can do the same thing with the covariance matrices: if the values have a finite precision, then only a finite number of different matrices are possible; these can be indexed, then the $D(D+1)/2$ parameters could be replaced with a single index. In fact, with finite precision, there are only a finite number of mixture models possible, so in principle, these could be indexed, reducing all the parameters to a single scalar.
However the information contained such an index must logically be the same as in the full set of parameters, so you're not really reducing the complexity of the model. If you're comparing two models using some information criteria (AIC, BIC, MDL etc.) then reducing the precision of all the models will not change the ranking of their relative goodness-of-fit.
Nice idea though!
- which other clustering methods (unsupervised classification) can I try for this problem?
For instance, parametric ones: you can fit a Gaussian Mixture Model by Expectation Maximization or Variational Bayes Inference; you test for different number of clusters and select the model that best fits your data. Be careful, model selection is not the same for a non-bayesian method (such as EM), where you have a point estimate of the clustering, than for a bayesian method such has VB where you have a full posterior distribution.
- In DPMM (Dirichlet process mixture model), it is assumed that each
mixture component is Gaussian. Can non-Gaussian distribution be used
for mixture components?
The priors of the variables of a simple GMM are:
\begin{align*}
x_i | z_i, \theta_{z_i} &\sim F(\theta_{z_i})\\
\theta_j &\sim G_0\\
z_i &\sim \text{Discrete}(\boldsymbol{\pi})\\
\pi &\sim \text{Dirichlet}(\boldsymbol{\alpha})
\end{align*}
where $F$ is a Normal distribution and $\pi$ has length $k$. The collapsed version comes after integrating out $\pi$. The Dirichlet Process / Chinese Restaurant Process arrive naturally when you compute the limit $k \rightarrow \infty$.
You can replace the likelihood $F$ by whatever function you like and you will get a mixture of Bernoullis, a mixture of Poissons, a mixture of...
3.a. In collapsed Gibbs sampling, the number of iterations for the algorithm convergence is assumed fixed.
Not at all!. Why should it? Gibbs sampling, and in general any MCMC method, needs some burn-in sampling until your Markov Chain arrives to the stationary distribution (the posterior distribution you are looking for). Once you are there, you keep sampling from your posterior until you are happy. But it is not possible a priori to know how many iterations you need to get to the stationary distribution. Actually, even a posteriori examining your chain of samples (aka traces), there are convergence tests but you never know for sure.
3.b. Is it possible the number of iterations is adaptive depend on the data and the
number of components?
So, as for adaptive techniques, you can periodically test whether your chain(s) (one for every variable you sample) converged, and even stop sampling if the test gives positive results. But it might be a false positive. And yet, you might have to check other things such as autocorrelation (to decide the thinning).
Usually the more variables you have (and more components also mean more variables), the longer it will take the Gibbs sampler to converge, but there is no mathematical formula for that.
Best Answer
As I understand, there are three questions in one, I will try to suggest some answers for all of them.
1. How to estimate the number of clusters?
In 1D I ones tried to estimate the number of clusters by density estimation. First you try to estimate density (for example, see
ksmooth
in R) with automatically established parameters. Then you can approximate the number of clusters by just evaluating the number of local maximums of the density. That would be a good number to start with, after that you can do cross-valudation (for the final method) to get the more exact number.2. How to fit the models with equal variance?
That can be easily incorporated into EM-procedure. During M-step you estimate the mean $\mu_k$ as usual, but the variance $\sigma_k$ like this: $\sigma_k = \sigma = \frac{1}{N}\sum_{i=1}^N (x_i - \mu_{k(x_i)})$, where $k(x_i)$ return an estimated class of the object $x_i$.
3. How to deal with uniform background noise?
If it's really uniformly distributed, it should not greatly affect the EM-procedure. And after all, you can get rid of it, just marking $x_i$ belonging to noisy class if the probability of this object belonging to any class $k$ is less then some threshold. You can think of incorporating this step inside EM-algorithm, but I am not sure how it would work.
Hope some of the suggestions above would help!