Model-Based Clustering – Mclust Model Selection

bicclusteringgaussian mixture distributionmodel-based-clusteringr

The R package mclust uses BIC as a criteria for cluster model selection. From my understanding, a model with the lowest BIC should be selected over other models (if you solely only care about BIC). However, when BIC values are all negative, the Mclust function defaults to the model with the highest BIC value. My overall understanding from various trials are that mclust identifies "best" models as those having the $max\{BIC_i\}$.

I am trying to understand why the authors made this decision. It is illustrated in the CRAN site: https://cran.r-project.org/web/packages/mclust/vignettes/mclust.html

Also, the authors of the mclust packages make a note of this in their paper Model-based Methods of Classification: Using the mclust Software in Chemometrics on page 5.

The ‘best’ model is taken to be the one with the highest BIC among
the fitted models.

Can anyone shine light on this issue? If a lower BIC is always better, why do the authors not choose the model with the lowest BIC but rather the model with the smallest absolute BIC? If possible, provide references.

Best Answer

Solution found:

So, to restate the question, why does the Mclust function default to the model with the highest BIC value as the "best" model?

Great question! Let me give you a long winded answer to this.

TL;DR: BIC values are an approximation to intergrated (not maximum) likelihood, and you want the model with the greatest integrated likelihood (Bayes factor) so you choose the model with the largest BIC.

Long answer: The purpose of using model based clustering over heuristic based clustering approaches such as k-means and hierarchical (agglomerative) clustering is to provide a more formal and intuitive approach to comparing and selecting an appropriate cluster model for your data.

Mclust uses clustering techniques based on probability models, Gaussian mixed models. Using probability models allows for development of model-based approaches to compare different cluster models and sizes. See * Model-based Methods of Classification: Using the mclust Software in Chemometrics* (https://www.jstatsoft.org/article/view/v018i06) for more details.

As mentioned above, the authors state that the "best" model is one with the largest BIC values. Here is another example from Enhanced Model-Based Clustering, Density Estimation, and Discriminant Analysis Software: MCLUST (https://www.stat.washington.edu/raftery/Research/PDF/fraley2003.pdf):

The Baysian Information Criterion or BIC (?) is the value of the maximized loglikelihood with a penalty on the number of parameters in the model, and allows comparison of models with differing parameterizations and/or differing numbers of clusters. In general the larger the value of the BIC, the stronger the evidence for the model and number of clusters (see, e.g. Fraley and Raftery 2002a).

Model Selection: Now that there is a probability model attached to the clusters, you can use more sophisticated tools to compare multiple cluster models using Bayesian model selection via Bayes factors.

In their paper, How Many Clusters? Which Clustering Method? Answers Via Model-Based Cluster Analysis (http://www.stat.washington.edu/raftery/Research/PDF/fraley1998.pdf)

The Bayes factor is the posterior odds for one model against the other assuming neither is favoured a priori. Banfield and Raftery [2] used a heuristically derived approximation to twice the log Bayes factor, called the ‘AWE’, to determine the number of clusters in hierarchical clustering based on the classification likelihood. When EM is used to find the maximum mixture likelihood, a more reliable approximation to twice the log Bayes factor called the BIC (Schwarz [32]) is applicable:

$2 \log(p(x | M)) + constant \approx 2 l_M (x, \hat{\theta}) - m_m log(n) \equiv BIC$

where $p(x |M)$ is the (integrated) likelihood of the data for the model M, $l_M(x, \hat{\theta})$ is the maixmized mixture log-likelihood for the model and m_M is the number of independent parameters to be estimated in the model. The number of clusteres is not considered an independent parameter for the purposes of computing the BIC. If each model is equally likeli $a \ priori$, then $p(x|M)$ is proportional to the posterior probability that the data conform to the model $M$. Accordingly, the larger the value of the BIC, the stronger the evidence for the model.

So, in summary, the BIC should not be minimized. The person using this model-based clustering approach should look for the model that maximizes the BIC as it approximates the Bayes factor with maximum integrated likelihood.

That last statement also has a reference:

Banfield, J. D. and Raftery, A. E. (1993) Model-based Gaussian and non-Gaussian clustering. Biometrics, 49, 803– 821.

EDIT: Based on an email exchange,

As a side note, always check how the BIC is defined. Sometimes, for example in most regression contexts (where traditionally a statistic is minimised for parameters estimation, e.g. residual sum of squares, deviance, etc) the BIC is computed as -2*loglik + npar*log(n), i.e. the reverse of what is used in mclust. Clearly, in that case the BIC should be minimised.

The general definition of the BIC is $ BIC = -2 \times ln(L(\theta | x)) + k \times ln(n)$; mclust does not include the negative component.

Related Question