Solved – R clustering using mclust: BIC are often NA

clusteringgaussian mixture distributionr

I'm working on segmentation/clustering and trying to use Gaussian Mixture Modelling for Model-Based Clustering. I'm using the R package Mclust in order to come up with the best fit for my data.

All data is transformed to a uniform distribution with mean zero, standard deviation one (I know, not Gaussian) and the variables included are chosen based on earlier attempts using k-means, where the given variables seemed to be discriminating. Of course, k-means comes with some drawbacks (lack of statistical foundation, no control of cross-correlation etc,), and that's the reason I want to use Model-Based Clustering (or latent class analysis, with the package poLCA).

When using mclustBIC, many of the possible BICs are actually NA. I tried to reduce the dimension of the data, but this didn't improve the output. For example the VEV is only calculated for nr clusters 1:3, while it looks like it could improve for more clusters (see plot below).

Someone who experienced similar problems? And can someone help me into the right direction for finding the best model, using mclust? I would like to calculate other BICs with a higher number of clusters.

Help would be appreciated!

enter image description here

Best Answer

I have noted this before in another question here.

My guess is that in some cases the model is over-parametrized and the model cannot be constructed. Or it is possible that the data needs regularization (adding a small constant to the co-variance matrices) to make them invertible.

Please let us know if you find this or other reasons and/or any work around.

BTW, scaling data to have mean zero and standard deviation of one does not make it uniform but actually preserves normality. Here is an example for that:

x <- rnorm(1000, mean=10, sd=3)
par(mfrow=c(1,2))
hist(x)
hist(scale(x))

enter image description here

EDIT

Looking at the Mclust documentation (page 13) enabling the conjugate prior withe the argument "prior=priorControl()" should produce fewer missing BIC values.