Solved – How to fit mixture model for clustering

clusteringgaussian mixture distributionr

I have two variables – X and Y and I need to make cluster maximum (and optimal) = 5. Let's ideal plot of variables is like following:

enter image description here

I would like to make 5 clusters of this. Something like this:

enter image description here

Thus I think this is mixture model with 5 clusters. Each clusters have center point and a confidence circle around it.

The clusters are not always pretty like this, they look like the following, where sometime two clusters are close together or one or two clusters are completely missing.

enter image description here

enter image description here

How can fit mixture model and perform classification (clustering) in this situation effectively?

Example:

set.seed(1234)
X <- c(rnorm(200, 10, 3), rnorm(200, 25,3),
        rnorm(200,35,3), rnorm(200,65, 3), rnorm(200,80,5))
Y <- c(rnorm(1000, 30, 2))
plot(X,Y, ylim = c(10, 60), pch = 19, col = "gray40")

Best Answer

Here is script for using mixture model using mcluster.

X <- c(rnorm(200, 10, 3), rnorm(200, 25,3), rnorm(200,35,3), rnorm(200,65, 3), rnorm(200,80,5))
Y <- c(rnorm(1000, 30, 2))
plot(X,Y, ylim = c(10, 60), pch = 19, col = "gray40")

require(mclust)
xyMclust <- Mclust(data.frame (X,Y))
plot(xyMclust)

enter image description here enter image description here

In a situation where there are less than 5 clusters:

X1 <- c(rnorm(200, 10, 3), rnorm(200, 25,3), rnorm(200,35,3),  rnorm(200,80,5))
Y1 <- c(rnorm(800, 30, 2))
xyMclust <- Mclust(data.frame (X1,Y1))
plot(xyMclust)

enter image description here

 xyMclust4 <- Mclust(data.frame (X1,Y1), G=3)
plot(xyMclust4)

enter image description here

In this case we are fitting 3 clusters. What if we fit 5 clusters ?

xyMclust4 <- Mclust(data.frame (X1,Y1), G=5)
plot(xyMclust4)

It can force to make 5 clusters.

enter image description here

Also let's introduce some random noise:

X2 <- c(rnorm(200, 10, 3), rnorm(200, 25,3), rnorm(200,35,3),  rnorm(200,80,5), runif(50,1,100 ))
Y2 <- c(rnorm(850, 30, 2))
xyMclust1 <- Mclust(data.frame (X2,Y2))
plot(xyMclust1)

mclust allows model-based clustering with noise, namely outlying observations that do not belong to any cluster. mclust allows to specify a prior distribution to regularize the fit to the data. A function priorControl is provided in mclust for specifying the prior and its parameters. When called with its defaults, it invokes another function called defaultPrior which can serve as a template for specifying alternative priors. To include noise in the modeling, an initial guess of the noise observations must be supplied via the noise component of the initialization argument in Mclust or mclustBIC.

enter image description here

The other alternative would be to use mixtools package that allows you to specify mean and sigma for each components.

X2 <- c(rnorm(200, 10, 3), rnorm(200, 25,3), rnorm(200,35,3),
    rnorm(200,80,5), rpois(50,30))
Y2 <- c(rnorm(800, 30, 2), rpois(50,30))
df <- cbind (X2, Y2)
require(mixtools)
out <- mvnormalmixEM(df, lambda = NULL, mu = NULL, sigma = NULL,
   k = 5,arbmean = TRUE, arbvar = TRUE, epsilon = 1e-08,  maxit = 10000, verb = FALSE)
plot(out, density = TRUE, alpha = c(0.01, 0.05, 0.10, 0.12, 0.15),  marginal = TRUE)

enter image description here

Related Question