Solved – Hybrid (K-means + Hierarchical ) clustering

clusteringk-meanslarge datamachine learningsparse

I have a huge dataset (50,000 2000-dimensional sparse feature vectors). I want to cluster them in to k (unknown)clusters. As hierarchical clustering is very expensive in terms of time complexity (though it provides better result), I have designed my clustering framework as follows:

  1. do K-means clustering to partition the data into several bins (k is unknown so I make it reasonably large. eg. k=500)
  2. get centroids of all 500 partitions
  3. do hierarchical clustering on those 500 centroids(kind of merging based on some threshold value t)
  4. assign the datapoints to the nearest centroid(centroids emerged from hierarchical
    clustering)

I would like to know, whether my approach is efficient and if possible any other good solution to this problem.

Thank you.

Best Answer

Why don't just first look at other clustering algorithms? For example DBSCAN does not need to know the number of clusters, and it can work with any distance function (assuming that you want to go with something like cosine distance for your sparse vectors).

Given the characteristics of your data, k-means is just bound to fail. The problem is that the means will most likely be no longer sparse, so they are actually outliers, and by no means central objects. Don't use k-means with sparse vectors or non-euclidean distances! (k-means may not converge for other distance functions, as the mean might not minimize the objective function anymore!)

Seriously, there are at least 100 more modern clustering algorithms. Try these first.

Related Question