Solved – Determining characteristics of peaks after mclust finite mixture model

clusteringgaussian mixture distributionnormal distributionr

I'm working with the mclust package in R (specifically using densityMclust). As output, I have a file with mixing probabilities, variances, and means for each normal distribution. The general format is:

foo prob1 var1 mean1 prob2 var2 mean2...
foo2 prob1 var1 mean1 prob2 var2 mean2...

for as many normal distributions as I can force.

This is all I have; I don't even have access to the actual raw data. My goal is to identify relative maximas (peaks). Would I go about doing this by finding the largest means?

I have a few more questions:

  1. I'm having trouble understanding what exactly a mixing probability is. If anyone could please direct me to a resource that I couldn't find, I'd really appreciate it.

  2. Since I only have the 3 data types listed above, how would I find standard deviation values without the range of raw data values?

  3. And also any better description for a Gaussian finite mixture model than mine: forces data into multiple bell curves…?

  4. Why are these mixture models called unsupervised learning? All the explanations I bumped into online were way too technical for me.

  5. As shown in the data above, I often have >2 peaks. However I'm only concerned with the first two. Can I ignore the rest and safely call that part a bimodal distribution?

Best Answer

Your questions (reposted so I can be systematic and complete) with my answers inline follow:

Question1: I'm having trouble understanding what exactly a mixing probability is. If anyone could please direct me to a resource that I couldn't find, I'd really appreciate it.

Answer1:

Sweet and salty can go great together. So lets say you have a pretzel that you want to sprinkle a mix of salt and sugar over. Do you HAVE to do 50/50? No way. You can vary it from 1 molecule of salt and a spoon of sugar all the way to 1 molecule of sugar and a spoon of salt. The weights sum to 1.0 and really make up the proportions of the mix. If you have two components and the first weight is 0.00001 and the second is 0.99999, that means the first is only slight and the second is huge.

For most ML things, I love autonlab. Here is their materials on mixture models. (link). This stuff tends to be pretty good.

Question 2: Since I only have the 3 data types listed above, how would I find standard deviation values without the range of raw data values?

Answer 2:

You can back it out of the parameter. Here is the documentation for mclust. (link) Reading through it, while booring, can also be highly informative. The first thing to try is the "summary" command.

est <- mclust(.... stuff ...) 
summary(est)

If the parameter values show up in the displayed summary, then they are 1) stored somewhere, and 2) able to be extracted.

The fact that this selects the model with best parameters (link) means that they can be accessed. I recommend you read through the documentation - it is in there. I also recommend you use "Rstudio" because it has a variable explorer that allows you to look at fields and sub-fields of the "est".

Question 3. And also any better description for a Gaussian finite mixture model than mine: forces data into multiple bell curves...?

See the link in answer 1 for Autonlab. A finite mixture model is a linear weighted sum of probability densities. A gaussian mixture, implicitly homogenous, is comprised as a weighted sum of Gaussian distributions. There are non-Gaussian mixtures. There are non-homogenous mixtures. I particularly like the Zero Inflated Poisson by Diane Lambert. I felt it was quite revolutionary in thought and valuable in application - allowing yield to be pushed closer to unity than a homogenous model and much of the thought at the time of publication.

Question 4. Why are these mixture models called unsupervised learning? All the explanations I bumped into online were way too technical for me.

You pick "number of components" and click and it goes. That is unsupervised. If you knew something about classification, and that was part of the input, then it would qualify as supervised.

Question 5. As shown in the data above, I often have >2 peaks. However I'm only concerned with the first two. Can I ignore the rest and safely call that part a bimodal distribution?

Throw away results only when you know why you are doing it. Whether or not you realize it, you are doing voodoo and not science. You don't know why it works, and if you did it twice you might get radically different results. Science is about knowing "why" before doing. It is about repeatable results.

In production, after we had root-caused one of the modes, only then could we focus on the other. If you spent time to get input data, and you knew what you were doing when you specified mode count (aka number of components) then throwing one out without a good reason is a bad idea.

That said, you could look at the weights and make an argument like 95% of the weight is in the first 2 modes, so I am going to focus on them. Management who doesn't realize how stunningly bad of an idea it is, will often let themselves fall prey to arguments like 80/20 rule - therefore I'm only retaining 80%. Remember that they are the technical illiterate (aka non-technical) and you are serving them quite poorly - not doing due diligence in your area of expertise - if such things are allowed to occur. Math, Statistics, Data-analysis folks - we are the guardians of truth. Not religious or political truth but actual "the universe is speaking" truth. If you get loose with what you know or what you don't then you are doing a bad job, and you are defacing the hard work and excellence of your peers in the field. Don't do it. Being ruthlessly aggressive about what you don't know and do know - never letting a hint of falsehood pass your lips - is how to build a great credibility and how to make a career worth having. Don't be an ass to folks - but be as excellent as you are capable of in your work.

Best of luck

Related Question