Solved – Best BIC value for K-means clusters

bicclusteringk-means

I am using code from Using BIC to estimate the number of k in KMEANS (answer by Prabhath Nanisetty) to find BIC values for K-means using different number of components. However, using iris dataset, I get following results:

N_clusters        BIC  
1         -863.896405          
2         -674.133038          
3         -616.557809           
4         -603.357368           
5         -582.428798           
6         -596.073710           
7         -590.086212           
8         -579.876476           
9         -554.665433

This is shown in following plot:

The plot after standardization of data:

Is is normal to have negative values for BIC. Which is the best number of clusters by BIC here, especially considering that iris data set has 3 groups? Most negative value in above list is for 1 cluster only.

Best Answer

I also use the code from the link you provided.

First thing, it is normal to have negative values of BIC. As you are using BIC = likelihood - penalty you want to find the highest value, which in your first image clearly we would pick N_clusters = 8 and in the second image N_clusters = 9.

I get almost the same if I use the squared euclidean distance:

If I use the euclidean distance I get the expected results and this is the formula I've been using because I've made some tests and it seems correct.

The results using the appropriate euclidean distance gives me this plot:

And here we can obviously see that the appropriate number of clusters to pick is 3 (Setosa, Versicolor and Virginica).

One last note is that it doesn't make sense to set your minimum n_clusters to 1, it should start with 2. I only started with 1 to make the plot look like yours.

Related Solutions

Solved – Using the stats package in R for kmeans clustering

I did not grasp question 1 completely, but I'll attempt an answer. The plot of Q1 shows how the within sum of squares (wss) changes as cluster number changes. In this kind of plots you must look for the kinks in the graph, a kink at 5 indicates that it is a good idea to use 5 clusters.

WSS has a relationship with your variables in the following sense, the formula for WSS is

$\sum_{j} \sum_{x_i \in C_j} ||x_i - \mu_j||^2$

where $\mu_j$ is the mean point for cluster $j$ and $x_i$ is the $i$-th observation. We denote cluster j as $C_j$. WSS is sometimes interpreted as "how similar are the points inside of each cluster". This similarity refers to the variables.

The answer to question 2 is this. What you are actually watching in the clusplot() is the plot of your observations in the principal plane. What this function is doing is calculating the principal component score for each of your observations, plotting those scores and coloring by cluster.

Principal component analysis (PCA) is a dimension reduction technique; it "summarizes" the information of all variables into a couple of "new" variables called components. Each component is responsible of explaining certain percentage of the total variability. In the example you read "This two components explain 73.95% of the total variability".

The function clusplot() is used to identify the effectiveness of clustering. In case you have a successful clustering you will see that clusters are clearly separated in the principal plane. On the other hand, you will see the clusters merged in the principal plane when clustering is unsuccessful.

For further reference on principal component analysis you may read wiki. if you want a book I suggest Modern Multivariate Techniques by Izenmann, there you will find PCA and k-means.

Hope this helps :)

Using BIC – Estimating Number of Clusters in K-Means with Python

It seems you have a few errors in your formulas, as determined by comparing to:

np.sum([n[i] * np.log(n[i]) -
               n[i] * np.log(N) -
             ((n[i] * d) / 2) * np.log(2*np.pi) -
              (n[i] / 2) * np.log(cl_var[i]) -
             ((n[i] - m) / 2) for i in range(m)]) - const_term

Here there are three errors in the paper, fourth and fifth lines are missing a factor of d, the last line substitute m for 1. It should be:

np.sum([n[i] * np.log(n[i]) -
               n[i] * np.log(N) -
             ((n[i] * d) / 2) * np.log(2*np.pi*cl_var) -
             ((n[i] - 1) * d/ 2) for i in range(m)]) - const_term

The const_term:

const_term = 0.5 * m * np.log(N)

should be:

const_term = 0.5 * m * np.log(N) * (d+1)

The variance formula:

cl_var = [(1.0 / (n[i] - m)) * sum(distance.cdist(p[np.where(label_ == i)], [centers[0][i]], 'euclidean')**2)  for i in range(m)]

should be a scalar:

cl_var = (1.0 / (N - m) / d) * sum([sum(distance.cdist(p[np.where(labels == i)], [centers[0][i]], 'euclidean')**2) for i in range(m)])

Use natural logs, instead of your base10 logs.

Finally, and most importantly, the BIC you are computing has an inverse sign from the regular definition. so you are looking to maximize instead of minimize

Best Answer

Related Solutions

Solved – Using the stats package in R for kmeans clustering

Using BIC – Estimating Number of Clusters in K-Means with Python

Related Question