See, even hierarchical clustering needs parameters if you want to get a partitioning out. In fact, hierarchical clustering has (roughly) four parameters: 1. the actual algorithm (divisive vs. agglomerative), 2. the distance function, 3. the linkage criterion (single-link, ward, etc.) and 4. the distance threshold at which you cut the tree (or any other extraction method).
Fact is that there doesn't exist any good "push button" solution to cluster analysis. It is an explorative technique, meaning that you have to try different methods and parameters and analyze the result.
I found DBSCAN to be very usable in most cases. Yes, it has two parameters (distance threshold aka: neighbor predicate, and minpts aka core predicate) - I'm not counting the distance function separately this time, because it's really a "is neighbor of" binary predicate that is needed; see GDBSCAN.
The reason is that in many applications you can choose these values intuitively if you have understood your data well enough. E.g. when working with Geo data, distance is literatlly in kilometers, and it allows me to intuitively specify the spatial resolution.
Similarly, minpts gives me an intuitive control over how "significant" a subset of observations needs to be before it becomes a cluster.
Usually, when you find DBSCAN hard to use, it is because you have not understood "distance" on your data yet. You then first need to figure out how to measure distance and what the resulting numbers mean to you. Then you'll know the threshold to use.
And in the end go and try out stuff. It's data exploration, not "return(truth);
". There is not "true" clustering. There are only "obvious", "useless" and "interesting" clusterings, and these qualities cannot be measured mathematically; they are subjective to the user.
Choosing DBSCAN parameters that would suit all of your data sets will likely not work. Plus, you said that clusters could overlap.
Maybe you should research all those follow-up algorithms (DBSCAN is 20 years old) such as OPTICS and HDBSCAN* if they better suit your problem. With 100 dimensions, subspace approaches (which often allow overlapping clusters) are worth looking at, too.
Best Answer
Are you sure that clustering big data is actually used anywhere?
As far as I can tell, it is not used. Everybody uses classification, nobody uses clustering. Because the clustering problem is much harder, and will require manual analysis of the results.
K-means: the usual Lloyd algorithm is naive parallel, and thus trivial to implement on Hadoop. But at the same time, it does not make sense to use k-means on big data. The reason is simple: there is no dense vector big data. K-means works well for say up to 10 dimensions. With double precision, I need 80 bytes per record then. A modest computer with 1 GB of RAM can then already fit some 13 million vectors into main memory. I have machines with 128 GB of RAM...
So you will have a hard time coming up with a real data set where:
The last point is important: k-means computes means. The quality of a mean does not infinitely improve when you add more data. You only get marginal changes (if the result is stable, i.e. k-means worked). Most likely, your distributed computation already lost more precision on the way than you gain in the end...
Now for DBSCAN: I'm not aware of a popular distributed implementation. Every now and then a new parallel DBSCAN is proposed, usually using grids, but I've never seen one being used in practise or publicly available. Again, there are problems with the availability of interesting data where it would make sense to use DBSCAN.
minPts
andepsilon
parameters? If you get this wrong, you won't have any clusters; or everything will be a single large custer.Many platforms/companies like to pretend they can reasonably run k-means on their cluster. But fact is, it does not make sense this way, and its just maketing and tech demo. That is why they usually use random data to show off, or the dreaded broken KDDCup1999 data set (which I still can cluster faster on a single computer than on any Hadoop cluster!).
So what is really done in practise