Solved – Singularity issues in Gaussian mixture model

gaussian mixture distribution

In chapter 9 of the book Pattern recognition and machine learning, there is this part about Gaussian mixture model:

To be honest I don't really understand why this would create a singularity. Can anyone explain this to me? I'm sorry but I'm just an undergraduate and a novice in machine learning, so my question may sound a little silly, but please help me. Thank you very much

Best Answer

If we want to fit a Gaussian to a single data point using maximum likelihood, we will get a very spiky Gaussian that "collapses" to that point. The variance is zero when there's only one point, which in the multi-variate Gaussian case, leads to a singular covariance matrix, so it's called the singularity problem.

When the variance gets to zero, the likelihood of the Gaussian component (formula 9.15) goes to infinity and the model becomes overfitted. This doesn't occur when we fit only one Gaussian to a number of points since the variance can not be zero. But it can happen when we have a mixture of Gaussians, as illustrated on the same page of PRML.

Update:
The book suggests two methods for addressing the singularity problem, which are

1) resetting the mean and variance when singularity occurs

2) using MAP instead of MLE by adding a prior.

Related Solutions

Solved – Determining characteristics of peaks after mclust finite mixture model

Your questions (reposted so I can be systematic and complete) with my answers inline follow:

Question1: I'm having trouble understanding what exactly a mixing probability is. If anyone could please direct me to a resource that I couldn't find, I'd really appreciate it.

Answer1:

Sweet and salty can go great together. So lets say you have a pretzel that you want to sprinkle a mix of salt and sugar over. Do you HAVE to do 50/50? No way. You can vary it from 1 molecule of salt and a spoon of sugar all the way to 1 molecule of sugar and a spoon of salt. The weights sum to 1.0 and really make up the proportions of the mix. If you have two components and the first weight is 0.00001 and the second is 0.99999, that means the first is only slight and the second is huge.

For most ML things, I love autonlab. Here is their materials on mixture models. (link). This stuff tends to be pretty good.

Question 2: Since I only have the 3 data types listed above, how would I find standard deviation values without the range of raw data values?

Answer 2:

You can back it out of the parameter. Here is the documentation for mclust. (link) Reading through it, while booring, can also be highly informative. The first thing to try is the "summary" command.

est <- mclust(.... stuff ...) 
summary(est)

If the parameter values show up in the displayed summary, then they are 1) stored somewhere, and 2) able to be extracted.

The fact that this selects the model with best parameters (link) means that they can be accessed. I recommend you read through the documentation - it is in there. I also recommend you use "Rstudio" because it has a variable explorer that allows you to look at fields and sub-fields of the "est".

Question 3. And also any better description for a Gaussian finite mixture model than mine: forces data into multiple bell curves...?

See the link in answer 1 for Autonlab. A finite mixture model is a linear weighted sum of probability densities. A gaussian mixture, implicitly homogenous, is comprised as a weighted sum of Gaussian distributions. There are non-Gaussian mixtures. There are non-homogenous mixtures. I particularly like the Zero Inflated Poisson by Diane Lambert. I felt it was quite revolutionary in thought and valuable in application - allowing yield to be pushed closer to unity than a homogenous model and much of the thought at the time of publication.

Question 4. Why are these mixture models called unsupervised learning? All the explanations I bumped into online were way too technical for me.

You pick "number of components" and click and it goes. That is unsupervised. If you knew something about classification, and that was part of the input, then it would qualify as supervised.

Question 5. As shown in the data above, I often have >2 peaks. However I'm only concerned with the first two. Can I ignore the rest and safely call that part a bimodal distribution?

Throw away results only when you know why you are doing it. Whether or not you realize it, you are doing voodoo and not science. You don't know why it works, and if you did it twice you might get radically different results. Science is about knowing "why" before doing. It is about repeatable results.

In production, after we had root-caused one of the modes, only then could we focus on the other. If you spent time to get input data, and you knew what you were doing when you specified mode count (aka number of components) then throwing one out without a good reason is a bad idea.

That said, you could look at the weights and make an argument like 95% of the weight is in the first 2 modes, so I am going to focus on them. Management who doesn't realize how stunningly bad of an idea it is, will often let themselves fall prey to arguments like 80/20 rule - therefore I'm only retaining 80%. Remember that they are the technical illiterate (aka non-technical) and you are serving them quite poorly - not doing due diligence in your area of expertise - if such things are allowed to occur. Math, Statistics, Data-analysis folks - we are the guardians of truth. Not religious or political truth but actual "the universe is speaking" truth. If you get loose with what you know or what you don't then you are doing a bad job, and you are defacing the hard work and excellence of your peers in the field. Don't do it. Being ruthlessly aggressive about what you don't know and do know - never letting a hint of falsehood pass your lips - is how to build a great credibility and how to make a career worth having. Don't be an ass to folks - but be as excellent as you are capable of in your work.

Best of luck

Solved – choosing prior parameters for variational mixture of Gaussians

Good priors depend on your actual problem - in particular, I don't believe there are any truly universal defaults. One good way is to try to formulate (possibly weak and vague) domain-specific knowledge about the process that generated your data, e.g.:

"It's highly unlikely to have more than 12 components"
"It's highly unlikely to observe values larger than 80"

Note that those should not generally be informed by the actual data you collected but by what you would be able to say before gathering the data. (e.g. the data represent outdoor temperatures in Celsius therefore they will very likely lie in $[-50,80]$ even before looking at data). It is also OK to motivate your priors by the computational machinery you use (e.g. I will collect 100 datapoints, hence I can safely assume it is unlikely to have more than 10 components since I won't have enough data to locate more components anyway)

Some of those statements can be translated directly into priors - e.g. you can set $m_0$ and $W_0^{-1}$ so that 95% of the prior mass is over the expected range of values.

For the less intuitive parameters (or just as another robustness check), you can follow the Visualization in Bayesian workflow paper and do prior predictive checks: this means that you simulate a large number of new datasets starting from your prior. You can then visualize them to see if they

don't violate your expectations too often (it is good to leave some room for surprises, hence aiming for something like 90% or 95% of simulations within your constraints)
otherwise cover the whole spectrum of values reasonably well

Best Answer

Related Solutions

Solved – Determining characteristics of peaks after mclust finite mixture model

Solved – choosing prior parameters for variational mixture of Gaussians

Related Question