Solved – How to summarize and understand the reults of DBSCAN clustering on big data

clusteringdata mininginterpretationlarge data

Many clustering algorithms can be used with big data, eg. versions of KMeans, DBSCAN based on Hadoop, etc. But, with k means we will get k centroids for k clusters and we can map them to the space and somehow understand the results. But what about density based algorithms like DBSCAN? In DBSCAN we'll get m clusters containing millions and billions of data points in case of big-data.

How do we try to understand the results of such clusterings?

Clustering more data is costly and also meaningless if we don't understand each cluster.

Best Answer

Are you sure that clustering big data is actually used anywhere?

As far as I can tell, it is not used. Everybody uses classification, nobody uses clustering. Because the clustering problem is much harder, and will require manual analysis of the results.

K-means: the usual Lloyd algorithm is naive parallel, and thus trivial to implement on Hadoop. But at the same time, it does not make sense to use k-means on big data. The reason is simple: there is no dense vector big data. K-means works well for say up to 10 dimensions. With double precision, I need 80 bytes per record then. A modest computer with 1 GB of RAM can then already fit some 13 million vectors into main memory. I have machines with 128 GB of RAM...

So you will have a hard time coming up with a real data set where:

  • I run out of memory on a single computer.
  • k-means produces notable results. (On high dimensional data, k-means is usually only as effective as random voronoi partitions!)
  • the result improves over a sample.

The last point is important: k-means computes means. The quality of a mean does not infinitely improve when you add more data. You only get marginal changes (if the result is stable, i.e. k-means worked). Most likely, your distributed computation already lost more precision on the way than you gain in the end...

Now for DBSCAN: I'm not aware of a popular distributed implementation. Every now and then a new parallel DBSCAN is proposed, usually using grids, but I've never seen one being used in practise or publicly available. Again, there are problems with the availability of interesting data where it would make sense to use DBSCAN.

  • For big data, how do you set the minPts and epsilon parameters? If you get this wrong, you won't have any clusters; or everything will be a single large custer.
  • If your data is low-dimensional, see above for k-means. Using techniques such as R*-trees and grids, a single computer can already cluster low-dimensional data with billions of points using DBSCAN.
  • If you have complex data, where indexing no longer works, DBSCAN will scale quadratically and thus be an inappropriate choice for big data.

Many platforms/companies like to pretend they can reasonably run k-means on their cluster. But fact is, it does not make sense this way, and its just maketing and tech demo. That is why they usually use random data to show off, or the dreaded broken KDDCup1999 data set (which I still can cluster faster on a single computer than on any Hadoop cluster!).

So what is really done in practise

  • The Hadoop cluster is your data warehouse (rebranded as fancy new big data).
  • You run distributed preprocessing on your raw data, to massage it into shape.
  • The preprocessed data is small enough to be clustered on a single computer, with more advanced algorithms (that may even scale quadratic, and do not have to be naive parallel)
  • You sell it to your marketing department
  • Your marketing department sells it to the CSomethingO.
  • Everybody is happy, because they are now big data experts.
Related Question