## Are you sure that *clustering big data* is actually used anywhere?

As far as I can tell, it is *not* used. Everybody uses classification, nobody uses clustering. Because the clustering problem is much harder, and will require **manual analysis** of the results.

K-means: the usual Lloyd algorithm is naive parallel, and thus trivial to implement on Hadoop. But at the same time, **it does not make sense to use k-means on big data**. The reason is simple: **there is no dense vector big data**. K-means works well for say up to 10 dimensions. With double precision, I need 80 bytes per record then. A modest computer with 1 GB of RAM can then already fit some 13 million vectors into main memory. I have machines with 128 GB of RAM...

So you will have a hard time coming up with a real data set where:

- I run out of memory on a single computer.
- k-means produces notable results. (On high dimensional data, k-means is usually only as effective as random voronoi partitions!)
- the result improves over a sample.

The last point is important: k-means computes *means*. The quality of a *mean* does not infinitely improve when you add more data. You only get *marginal* changes (if the result is stable, i.e. k-means worked). Most likely, your distributed computation already lost more precision on the way than you gain in the end...

Now for DBSCAN: I'm not aware of a *popular* distributed implementation. Every now and then a new parallel DBSCAN is proposed, usually using grids, but I've never seen one being used in practise or publicly available. Again, there are problems with the availability of interesting data where it would make *sense* to use DBSCAN.

- For big data, how do you set the
`minPts`

and `epsilon`

parameters? If you get this wrong, you won't have any clusters; or everything will be a single large custer.
- If your data is low-dimensional, see above for k-means. Using techniques such as R*-trees and
*grids*, a single computer can already cluster low-dimensional data with *billions* of points using DBSCAN.
- If you have complex data, where indexing no longer works, DBSCAN will scale quadratically and thus be an inappropriate choice for big data.

Many platforms/companies like to pretend they can reasonably run k-means on their cluster. But fact is, it does not make sense this way, and its just maketing and tech demo. That is why they usually use random data to show off, or the dreaded broken KDDCup1999 data set (which I still can cluster faster on a single computer than on any Hadoop cluster!).

## So what is really done in practise

- The Hadoop cluster is your data warehouse (rebranded as fancy new big data).
- You run distributed preprocessing on your raw data, to massage it into shape.
- The preprocessed data is small enough to be clustered on a single computer, with more advanced algorithms (that may even scale quadratic, and do not have to be naive parallel)
- You sell it to your marketing department
- Your marketing department sells it to the CSomethingO.
- Everybody is happy, because they are now big data experts.

Choosing DBSCAN parameters that would suit all of your data sets will likely not work. Plus, you said that clusters could overlap.

Maybe you should research all those follow-up algorithms (DBSCAN is 20 years old) such as OPTICS and HDBSCAN* if they better suit your problem. With 100 dimensions, subspace approaches (which often allow overlapping clusters) are worth looking at, too.

## Best Answer

The K-means algorithm and the EM algorithm are going to be pretty similar for 1D clustering.

In K-means you start with a guess where the means are and assign each point to the cluster with the closest mean, then you recompute the means (and variances) based on current assignments of points, then update the assigment of points, then update the means ...

In EM you would also start with a guess where the means are, then you compute the expected value of the assignments (essentially the probability of each point being in each cluster), then you update the estimated means (and variances) using the expected values as weights, then compute new expected values, then compute new means, ...

The primary difference is that the assignment of points to clusters in K-means is an all or nothing, where EM gives proportions/probability of group membership (one point may be seen as having 80% probability of being in group A, 18% probability of being in group B, and 2% probability of being in group C). If there is a lot of seperation between the groups then the 2 methods are going to give pretty similar results. But if there is a fair amount of overlap then the EM will probably give more meaningful results (even more if the variance/standard deviation is of interest). But if all you care about is assigning group membership without caring about the parameters, then K-means is probably simpler.

Why not do both and see how different the answers are? if they are similar then go with the simpler one, if they are different then decide on comparing the grouping to the data and outside knowledge.