Model-Based Clustering – Mclust Model Selection

bicclusteringgaussian mixture distributionmodel-based-clusteringr

The R package mclust uses BIC as a criteria for cluster model selection. From my understanding, a model with the lowest BIC should be selected over other models (if you solely only care about BIC). However, when BIC values are all negative, the Mclust function defaults to the model with the highest BIC value. My overall understanding from various trials are that mclust identifies "best" models as those having the $max\{BIC_i\}$.

I am trying to understand why the authors made this decision. It is illustrated in the CRAN site: https://cran.r-project.org/web/packages/mclust/vignettes/mclust.html

Also, the authors of the mclust packages make a note of this in their paper Model-based Methods of Classification: Using the mclust Software in Chemometrics on page 5.

The ‘best’ model is taken to be the one with the highest BIC among
the fitted models.

Can anyone shine light on this issue? If a lower BIC is always better, why do the authors not choose the model with the lowest BIC but rather the model with the smallest absolute BIC? If possible, provide references.

Best Answer

Solution found:

So, to restate the question, why does the Mclust function default to the model with the highest BIC value as the "best" model?

Great question! Let me give you a long winded answer to this.

TL;DR: BIC values are an approximation to intergrated (not maximum) likelihood, and you want the model with the greatest integrated likelihood (Bayes factor) so you choose the model with the largest BIC.

Long answer: The purpose of using model based clustering over heuristic based clustering approaches such as k-means and hierarchical (agglomerative) clustering is to provide a more formal and intuitive approach to comparing and selecting an appropriate cluster model for your data.

Mclust uses clustering techniques based on probability models, Gaussian mixed models. Using probability models allows for development of model-based approaches to compare different cluster models and sizes. See * Model-based Methods of Classification: Using the mclust Software in Chemometrics* (https://www.jstatsoft.org/article/view/v018i06) for more details.

As mentioned above, the authors state that the "best" model is one with the largest BIC values. Here is another example from Enhanced Model-Based Clustering, Density Estimation, and Discriminant Analysis Software: MCLUST (https://www.stat.washington.edu/raftery/Research/PDF/fraley2003.pdf):

The Baysian Information Criterion or BIC (?) is the value of the maximized loglikelihood with a penalty on the number of parameters in the model, and allows comparison of models with differing parameterizations and/or differing numbers of clusters. In general the larger the value of the BIC, the stronger the evidence for the model and number of clusters (see, e.g. Fraley and Raftery 2002a).

Model Selection: Now that there is a probability model attached to the clusters, you can use more sophisticated tools to compare multiple cluster models using Bayesian model selection via Bayes factors.

In their paper, How Many Clusters? Which Clustering Method? Answers Via Model-Based Cluster Analysis (http://www.stat.washington.edu/raftery/Research/PDF/fraley1998.pdf)

The Bayes factor is the posterior odds for one model against the other assuming neither is favoured a priori. Banfield and Raftery [2] used a heuristically derived approximation to twice the log Bayes factor, called the ‘AWE’, to determine the number of clusters in hierarchical clustering based on the classification likelihood. When EM is used to find the maximum mixture likelihood, a more reliable approximation to twice the log Bayes factor called the BIC (Schwarz [32]) is applicable:

$2 \log(p(x | M)) + constant \approx 2 l_M (x, \hat{\theta}) - m_m log(n) \equiv BIC$

where $p(x |M)$ is the (integrated) likelihood of the data for the model M, $l_M(x, \hat{\theta})$ is the maixmized mixture log-likelihood for the model and m_M is the number of independent parameters to be estimated in the model. The number of clusteres is not considered an independent parameter for the purposes of computing the BIC. If each model is equally likeli $a \ priori$, then $p(x|M)$ is proportional to the posterior probability that the data conform to the model $M$. Accordingly, the larger the value of the BIC, the stronger the evidence for the model.

So, in summary, the BIC should not be minimized. The person using this model-based clustering approach should look for the model that maximizes the BIC as it approximates the Bayes factor with maximum integrated likelihood.

That last statement also has a reference:

Banfield, J. D. and Raftery, A. E. (1993) Model-based Gaussian and non-Gaussian clustering. Biometrics, 49, 803– 821.

EDIT: Based on an email exchange,

As a side note, always check how the BIC is defined. Sometimes, for example in most regression contexts (where traditionally a statistic is minimised for parameters estimation, e.g. residual sum of squares, deviance, etc) the BIC is computed as -2*loglik + npar*log(n), i.e. the reverse of what is used in mclust. Clearly, in that case the BIC should be minimised.

The general definition of the BIC is $ BIC = -2 \times ln(L(\theta | x)) + k \times ln(n)$; mclust does not include the negative component.

Related Solutions

Solved – Determining characteristics of peaks after mclust finite mixture model

Your questions (reposted so I can be systematic and complete) with my answers inline follow:

Question1: I'm having trouble understanding what exactly a mixing probability is. If anyone could please direct me to a resource that I couldn't find, I'd really appreciate it.

Answer1:

Sweet and salty can go great together. So lets say you have a pretzel that you want to sprinkle a mix of salt and sugar over. Do you HAVE to do 50/50? No way. You can vary it from 1 molecule of salt and a spoon of sugar all the way to 1 molecule of sugar and a spoon of salt. The weights sum to 1.0 and really make up the proportions of the mix. If you have two components and the first weight is 0.00001 and the second is 0.99999, that means the first is only slight and the second is huge.

For most ML things, I love autonlab. Here is their materials on mixture models. (link). This stuff tends to be pretty good.

Question 2: Since I only have the 3 data types listed above, how would I find standard deviation values without the range of raw data values?

Answer 2:

You can back it out of the parameter. Here is the documentation for mclust. (link) Reading through it, while booring, can also be highly informative. The first thing to try is the "summary" command.

est <- mclust(.... stuff ...) 
summary(est)

If the parameter values show up in the displayed summary, then they are 1) stored somewhere, and 2) able to be extracted.

The fact that this selects the model with best parameters (link) means that they can be accessed. I recommend you read through the documentation - it is in there. I also recommend you use "Rstudio" because it has a variable explorer that allows you to look at fields and sub-fields of the "est".

Question 3. And also any better description for a Gaussian finite mixture model than mine: forces data into multiple bell curves...?

See the link in answer 1 for Autonlab. A finite mixture model is a linear weighted sum of probability densities. A gaussian mixture, implicitly homogenous, is comprised as a weighted sum of Gaussian distributions. There are non-Gaussian mixtures. There are non-homogenous mixtures. I particularly like the Zero Inflated Poisson by Diane Lambert. I felt it was quite revolutionary in thought and valuable in application - allowing yield to be pushed closer to unity than a homogenous model and much of the thought at the time of publication.

Question 4. Why are these mixture models called unsupervised learning? All the explanations I bumped into online were way too technical for me.

You pick "number of components" and click and it goes. That is unsupervised. If you knew something about classification, and that was part of the input, then it would qualify as supervised.

Question 5. As shown in the data above, I often have >2 peaks. However I'm only concerned with the first two. Can I ignore the rest and safely call that part a bimodal distribution?

Throw away results only when you know why you are doing it. Whether or not you realize it, you are doing voodoo and not science. You don't know why it works, and if you did it twice you might get radically different results. Science is about knowing "why" before doing. It is about repeatable results.

In production, after we had root-caused one of the modes, only then could we focus on the other. If you spent time to get input data, and you knew what you were doing when you specified mode count (aka number of components) then throwing one out without a good reason is a bad idea.

That said, you could look at the weights and make an argument like 95% of the weight is in the first 2 modes, so I am going to focus on them. Management who doesn't realize how stunningly bad of an idea it is, will often let themselves fall prey to arguments like 80/20 rule - therefore I'm only retaining 80%. Remember that they are the technical illiterate (aka non-technical) and you are serving them quite poorly - not doing due diligence in your area of expertise - if such things are allowed to occur. Math, Statistics, Data-analysis folks - we are the guardians of truth. Not religious or political truth but actual "the universe is speaking" truth. If you get loose with what you know or what you don't then you are doing a bad job, and you are defacing the hard work and excellence of your peers in the field. Don't do it. Being ruthlessly aggressive about what you don't know and do know - never letting a hint of falsehood pass your lips - is how to build a great credibility and how to make a career worth having. Don't be an ass to folks - but be as excellent as you are capable of in your work.

Best of luck

Solved – Fit indices using MCLUST latent cluster analysis

Just a couple thoughts, having used mclust a bit previously.

1) mclust uses the correct BIC selection method; see this post:

Mclust model selection

See the very bottom, but to sum it up, with BIC it depends if you use the negative sign in the formula or not whether you optimize low vs. high:

The general definition of the BIC is BIC=−2×ln(L(θ|x))+k×ln(n)BIC=−2×ln(L(θ|x))+k×ln(n); mclust does not include the negative component.

2) mclust uses mixture models to perform the clustering (i.e., model-based); it's quite different from k-means so I would be careful with the phrasing that it's a "tiny bit different than some of the other k-means cluster approaches" (mainly in what "other" implies here); the process for model selection is briefly described in the mclust manual:

mclust provides a Gaussian mixture fitted to the data by maximum likelihood through the EM algorithm, for the model and number of components selected according to BIC. The corresponding components are hierarchically combined according to an entropy criterion, following the methodology described in the article cited in the references section. The solutions with numbers of classes between the one selected by BIC and one are returned as a clustCombi class object.

It's more useful to see the actual paper for a thorough explanation:

https://www.stat.washington.edu/raftery/Research/PDF/Baudry2010.pdf or here https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2953822/

The entropy plot provided by mclust is meant to be interpreted like a scree plot for a factor analysis (i.e., by looking for an elbow to determine the optimal number of classes); I would argue scree plots are useful for justifying the choice of number of clusters, and these plots belong in the appendices.

mclust does also return the ICL statistic in addition to BIC, so you could choose to report that as a compromise to the reviewer:

https://cran.r-project.org/web/packages/mclust/vignettes/mclust.html (see the example on how to get it to output the statistics)

3) if you wanted to create a table of the entPlot values, you can extract them like so (from the ?entPlot example):

## Not run: 
data(Baudry_etal_2010_JCGS_examples)
# run Mclust to get the MclustOutput
output <- clustCombi(ex4.2, modelNames = "VII") 

entPlot(output$MclustOutput$z, output$combiM, reg = c(2,3)) 
# legend: in red, the single-change-point piecewise linear regression;
#         in blue, the two-change-point piecewise linear regression.

# added code to extract entropy values from the plot

combiM <- output$combiM
Kmax <- ncol(output$MclustOutput$z)
z0 <- output$MclustOutput$z
ent <- numeric()

for (K in Kmax:1) {
  z0 <- t(combiM[[K]] %*% t(z0))
  ent[K] <- -sum(mclust:::xlog(z0))
}

data.frame(`Number of clusters` = 1:Kmax, `Entropy` = round(ent, 3))

  Number.of.clusters Entropy
1                  1   0.000
2                  2   0.000
3                  3   0.079
4                  4   0.890
5                  5   6.361
6                  6  20.158
7                  7  35.336
8                  8 158.008

Best Answer

Related Solutions

Solved – Determining characteristics of peaks after mclust finite mixture model

Solved – Fit indices using MCLUST latent cluster analysis

Related Question