Solved – Algorithms for 1D Gaussian mixture with equal variance and noise cluster

algorithmsdatasetgaussian mixture distributionmixture-distribution

I would like to fit a Gaussian mixture model to some data. The data is 1D and I want to constrain all the Gaussians to have equal variance. I would also like to have a uniform background noise cluster to pick up points that are not in a real cluster.

I don't know how many clusters there should be, and this can vary over a large range for different data sets (tens to thousands are possible), so ideally this should be selected automatically. It might also be useful to put some lower threshold on the mixing coefficient for each cluster to avoid fitting very tiny ones to the background noise.

Depending on the data set there will be from hundreds to many thousands of data points.

What would be a good way to pursue this problem? I am a bit overwhelmed by the possible options (EM, variational bayes, infinite GMM, dirichlet processes etc.), or if it is even possible.

Best Answer

As I understand, there are three questions in one, I will try to suggest some answers for all of them.

1. How to estimate the number of clusters?

In 1D I ones tried to estimate the number of clusters by density estimation. First you try to estimate density (for example, see ksmooth in R) with automatically established parameters. Then you can approximate the number of clusters by just evaluating the number of local maximums of the density. That would be a good number to start with, after that you can do cross-valudation (for the final method) to get the more exact number.

2. How to fit the models with equal variance?

That can be easily incorporated into EM-procedure. During M-step you estimate the mean $\mu_k$ as usual, but the variance $\sigma_k$ like this: $\sigma_k = \sigma = \frac{1}{N}\sum_{i=1}^N (x_i - \mu_{k(x_i)})$, where $k(x_i)$ return an estimated class of the object $x_i$.

3. How to deal with uniform background noise?

If it's really uniformly distributed, it should not greatly affect the EM-procedure. And after all, you can get rid of it, just marking $x_i$ belonging to noisy class if the probability of this object belonging to any class $k$ is less then some threshold. You can think of incorporating this step inside EM-algorithm, but I am not sure how it would work.

Hope some of the suggestions above would help!

Related Question