Your questions (reposted so I can be systematic and complete) with my answers inline follow:
Question1: I'm having trouble understanding what exactly a mixing probability is. If anyone could please direct me to a resource that I couldn't find, I'd really appreciate it.
Answer1:
Sweet and salty can go great together. So lets say you have a pretzel that you want to sprinkle a mix of salt and sugar over. Do you HAVE to do 50/50? No way. You can vary it from 1 molecule of salt and a spoon of sugar all the way to 1 molecule of sugar and a spoon of salt. The weights sum to 1.0 and really make up the proportions of the mix. If you have two components and the first weight is 0.00001 and the second is 0.99999, that means the first is only slight and the second is huge.
For most ML things, I love autonlab. Here is their materials on mixture models. (link). This stuff tends to be pretty good.
Question 2: Since I only have the 3 data types listed above, how would I find standard deviation values without the range of raw data values?
Answer 2:
You can back it out of the parameter. Here is the documentation for mclust. (link) Reading through it, while booring, can also be highly informative. The first thing to try is the "summary" command.
est <- mclust(.... stuff ...)
summary(est)
If the parameter values show up in the displayed summary, then they are 1) stored somewhere, and 2) able to be extracted.
The fact that this selects the model with best parameters (link) means that they can be accessed. I recommend you read through the documentation - it is in there. I also recommend you use "Rstudio" because it has a variable explorer that allows you to look at fields and sub-fields of the "est".
Question 3. And also any better description for a Gaussian finite mixture model than mine: forces data into multiple bell curves...?
See the link in answer 1 for Autonlab. A finite mixture model is a linear weighted sum of probability densities. A gaussian mixture, implicitly homogenous, is comprised as a weighted sum of Gaussian distributions. There are non-Gaussian mixtures. There are non-homogenous mixtures. I particularly like the Zero Inflated Poisson by Diane Lambert. I felt it was quite revolutionary in thought and valuable in application - allowing yield to be pushed closer to unity than a homogenous model and much of the thought at the time of publication.
Question 4. Why are these mixture models called unsupervised learning? All the explanations I bumped into online were way too technical for me.
You pick "number of components" and click and it goes. That is unsupervised. If you knew something about classification, and that was part of the input, then it would qualify as supervised.
Question 5. As shown in the data above, I often have >2 peaks. However I'm only concerned with the first two. Can I ignore the rest and safely call that part a bimodal distribution?
Throw away results only when you know why you are doing it. Whether or not you realize it, you are doing voodoo and not science. You don't know why it works, and if you did it twice you might get radically different results. Science is about knowing "why" before doing. It is about repeatable results.
In production, after we had root-caused one of the modes, only then could we focus on the other. If you spent time to get input data, and you knew what you were doing when you specified mode count (aka number of components) then throwing one out without a good reason is a bad idea.
That said, you could look at the weights and make an argument like 95% of the weight is in the first 2 modes, so I am going to focus on them. Management who doesn't realize how stunningly bad of an idea it is, will often let themselves fall prey to arguments like 80/20 rule - therefore I'm only retaining 80%. Remember that they are the technical illiterate (aka non-technical) and you are serving them quite poorly - not doing due diligence in your area of expertise - if such things are allowed to occur. Math, Statistics, Data-analysis folks - we are the guardians of truth. Not religious or political truth but actual "the universe is speaking" truth. If you get loose with what you know or what you don't then you are doing a bad job, and you are defacing the hard work and excellence of your peers in the field. Don't do it. Being ruthlessly aggressive about what you don't know and do know - never letting a hint of falsehood pass your lips - is how to build a great credibility and how to make a career worth having. Don't be an ass to folks - but be as excellent as you are capable of in your work.
Best of luck
Just a couple thoughts, having used mclust a bit previously.
1) mclust uses the correct BIC selection method; see this post:
Mclust model selection
See the very bottom, but to sum it up, with BIC it depends if you use the negative sign in the formula or not whether you optimize low vs. high:
The general definition of the BIC is
BIC=−2×ln(L(θ|x))+k×ln(n)BIC=−2×ln(L(θ|x))+k×ln(n); mclust does not
include the negative component.
2) mclust uses mixture models to perform the clustering (i.e., model-based); it's quite different from k-means so I would be careful with the phrasing that it's a "tiny bit different than some of the other k-means cluster approaches" (mainly in what "other" implies here); the process for model selection is briefly described in the mclust manual:
mclust provides a Gaussian mixture fitted to the data by maximum likelihood through the EM algorithm, for the model and number of components selected according to BIC. The corresponding components are hierarchically combined according to an entropy criterion, following the methodology described in the article cited in the references section. The solutions with numbers of classes between the one selected by BIC and one are returned as a clustCombi class object.
It's more useful to see the actual paper for a thorough explanation:
https://www.stat.washington.edu/raftery/Research/PDF/Baudry2010.pdf
or here https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2953822/
The entropy plot provided by mclust is meant to be interpreted like a scree plot for a factor analysis (i.e., by looking for an elbow to determine the optimal number of classes); I would argue scree plots are useful for justifying the choice of number of clusters, and these plots belong in the appendices.
mclust does also return the ICL statistic in addition to BIC, so you could choose to report that as a compromise to the reviewer:
https://cran.r-project.org/web/packages/mclust/vignettes/mclust.html (see the example on how to get it to output the statistics)
3) if you wanted to create a table of the entPlot values, you can extract them like so (from the ?entPlot example):
## Not run:
data(Baudry_etal_2010_JCGS_examples)
# run Mclust to get the MclustOutput
output <- clustCombi(ex4.2, modelNames = "VII")
entPlot(output$MclustOutput$z, output$combiM, reg = c(2,3))
# legend: in red, the single-change-point piecewise linear regression;
# in blue, the two-change-point piecewise linear regression.
# added code to extract entropy values from the plot
combiM <- output$combiM
Kmax <- ncol(output$MclustOutput$z)
z0 <- output$MclustOutput$z
ent <- numeric()
for (K in Kmax:1) {
z0 <- t(combiM[[K]] %*% t(z0))
ent[K] <- -sum(mclust:::xlog(z0))
}
data.frame(`Number of clusters` = 1:Kmax, `Entropy` = round(ent, 3))
Number.of.clusters Entropy
1 1 0.000
2 2 0.000
3 3 0.079
4 4 0.890
5 5 6.361
6 6 20.158
7 7 35.336
8 8 158.008
Best Answer
Solution found:
So, to restate the question, why does the
Mclust
function default to the model with the highest BIC value as the "best" model?Great question! Let me give you a long winded answer to this.
TL;DR: BIC values are an approximation to intergrated (not maximum) likelihood, and you want the model with the greatest integrated likelihood (Bayes factor) so you choose the model with the largest BIC.
Long answer: The purpose of using model based clustering over heuristic based clustering approaches such as k-means and hierarchical (agglomerative) clustering is to provide a more formal and intuitive approach to comparing and selecting an appropriate cluster model for your data.
Mclust uses clustering techniques based on probability models, Gaussian mixed models. Using probability models allows for development of model-based approaches to compare different cluster models and sizes. See * Model-based Methods of Classification: Using the mclust Software in Chemometrics* (https://www.jstatsoft.org/article/view/v018i06) for more details.
As mentioned above, the authors state that the "best" model is one with the largest BIC values. Here is another example from Enhanced Model-Based Clustering, Density Estimation, and Discriminant Analysis Software: MCLUST (https://www.stat.washington.edu/raftery/Research/PDF/fraley2003.pdf):
Model Selection: Now that there is a probability model attached to the clusters, you can use more sophisticated tools to compare multiple cluster models using Bayesian model selection via Bayes factors.
In their paper, How Many Clusters? Which Clustering Method? Answers Via Model-Based Cluster Analysis (http://www.stat.washington.edu/raftery/Research/PDF/fraley1998.pdf)
So, in summary, the BIC should not be minimized. The person using this model-based clustering approach should look for the model that maximizes the BIC as it approximates the Bayes factor with maximum integrated likelihood.
That last statement also has a reference:
Banfield, J. D. and Raftery, A. E. (1993) Model-based Gaussian and non-Gaussian clustering. Biometrics, 49, 803– 821.
EDIT: Based on an email exchange,
The general definition of the BIC is $ BIC = -2 \times ln(L(\theta | x)) + k \times ln(n)$; mclust does not include the negative component.