Solved – Best BIC value for K-means clusters

bicclusteringk-means

I am using code from Using BIC to estimate the number of k in KMEANS (answer by Prabhath Nanisetty) to find BIC values for K-means using different number of components. However, using iris dataset, I get following results:

N_clusters        BIC  
1         -863.896405          
2         -674.133038          
3         -616.557809           
4         -603.357368           
5         -582.428798           
6         -596.073710           
7         -590.086212           
8         -579.876476           
9         -554.665433           

This is shown in following plot:

enter image description here

The plot after standardization of data:

enter image description here

Is is normal to have negative values for BIC. Which is the best number of clusters by BIC here, especially considering that iris data set has 3 groups? Most negative value in above list is for 1 cluster only.

Best Answer

I also use the code from the link you provided.

First thing, it is normal to have negative values of BIC. As you are using BIC = likelihood - penalty you want to find the highest value, which in your first image clearly we would pick N_clusters = 8 and in the second image N_clusters = 9.

I get almost the same if I use the squared euclidean distance:

enter image description here

If I use the euclidean distance I get the expected results and this is the formula I've been using because I've made some tests and it seems correct.

The results using the appropriate euclidean distance gives me this plot:

enter image description here

And here we can obviously see that the appropriate number of clusters to pick is 3 (Setosa, Versicolor and Virginica).

One last note is that it doesn't make sense to set your minimum n_clusters to 1, it should start with 2. I only started with 1 to make the plot look like yours.