As you have noticed, any method that requires a full distance matrix won't work.
Memory is one thing, but the other is runtime. The typical implementations of hierarchical clustering are in $O(n^3)$ (I know that ELKI has SLINK, which is an $O(n^2)$ algorithm to single-link clustering). This just does not scale to large data sets.
PAM itself should not require a complete distance matrix, but the algorithm is known to scale badly, because it then needs to (re-)compute all pairwise distances within each cluster on each iteration to find the most central elements. This is much less if you have a large number of clusters, but nevertheless quite expensive!
Instead, you should look into methods that can use index structures for acceleration. With a good index, such clustering algorithms can run in $O(n log n)$ which is much better for large data sets.
However, for most of these algorithms, you first need to make sure your distance function is really good; then you need to consider ways to accelerate queries by using appropriate indexes.
Also note that in many cases - and this may well hold for PAM - you can run the algorithm on a sample first, then only refine it on the full data set. If your sample is representative, algorithms such as k-means and PAM should give you essentially the same result as on the complete data set.
Are you sure that clustering big data is actually used anywhere?
As far as I can tell, it is not used. Everybody uses classification, nobody uses clustering. Because the clustering problem is much harder, and will require manual analysis of the results.
K-means: the usual Lloyd algorithm is naive parallel, and thus trivial to implement on Hadoop. But at the same time, it does not make sense to use k-means on big data. The reason is simple: there is no dense vector big data. K-means works well for say up to 10 dimensions. With double precision, I need 80 bytes per record then. A modest computer with 1 GB of RAM can then already fit some 13 million vectors into main memory. I have machines with 128 GB of RAM...
So you will have a hard time coming up with a real data set where:
- I run out of memory on a single computer.
- k-means produces notable results. (On high dimensional data, k-means is usually only as effective as random voronoi partitions!)
- the result improves over a sample.
The last point is important: k-means computes means. The quality of a mean does not infinitely improve when you add more data. You only get marginal changes (if the result is stable, i.e. k-means worked). Most likely, your distributed computation already lost more precision on the way than you gain in the end...
Now for DBSCAN: I'm not aware of a popular distributed implementation. Every now and then a new parallel DBSCAN is proposed, usually using grids, but I've never seen one being used in practise or publicly available. Again, there are problems with the availability of interesting data where it would make sense to use DBSCAN.
- For big data, how do you set the
minPts
and epsilon
parameters? If you get this wrong, you won't have any clusters; or everything will be a single large custer.
- If your data is low-dimensional, see above for k-means. Using techniques such as R*-trees and grids, a single computer can already cluster low-dimensional data with billions of points using DBSCAN.
- If you have complex data, where indexing no longer works, DBSCAN will scale quadratically and thus be an inappropriate choice for big data.
Many platforms/companies like to pretend they can reasonably run k-means on their cluster. But fact is, it does not make sense this way, and its just maketing and tech demo. That is why they usually use random data to show off, or the dreaded broken KDDCup1999 data set (which I still can cluster faster on a single computer than on any Hadoop cluster!).
So what is really done in practise
- The Hadoop cluster is your data warehouse (rebranded as fancy new big data).
- You run distributed preprocessing on your raw data, to massage it into shape.
- The preprocessed data is small enough to be clustered on a single computer, with more advanced algorithms (that may even scale quadratic, and do not have to be naive parallel)
- You sell it to your marketing department
- Your marketing department sells it to the CSomethingO.
- Everybody is happy, because they are now big data experts.
Best Answer
First you need to be clear on what you need. Often clustering is not that interesting once you've understood what it actually does...
I'd assume that you first need to prepare the data, for example aggregate it in a more interesting way. That will likely give you more attributes, and much fewer instances.
Then frequent itemsets and association rules are often much more interesting on sales data rather than clusters.
But your data just has one category variable... Just split your data then you can use k-means which will (as long as you use a good algorithm and not Lloyd's) easily scale to the entire data. But since it is k-means, a few thousand data points will be enough, larger data only yield diminishing returns.
Anyway, never scale to a big data set, until you know that your approach works. That would be wasted time and resources to compute something on the big data just to find out that it's not what you wanted in the first place. First use a sample to understand the problem and the solution then work on scaling the solution to the entire data.