Solved – Cluster Big Data in R and Is Sampling Relevant

clusteringlarge datarsampling

I'm new to data science and have a problem finding clusters in a data set with 200,000 rows and 50 columns in R.

Since the data have both numeric and nominal variables, methods like K-means which uses Euclidean distance measure doesn't seems like an appropriate choice. So I turn to PAM, agnes and hclust which accepts a distance matrix as input.

The daisy method can work on mixed-type data but the distance matrix is just too big: 200,000 times 200,000 is much larger than 2^31-1 (the vector length limit before R 3.0.0.)

The new R 3.0.0 released yesterday supports long vectors with length longer than 2^31-1. But a double matrix of 200,000 by 200,000 requires a continuous RAM larger than 16Gb which is not possible on my machine.

I read about parallel computing and bigmemory package and am not sure if they are going to help: if I'm using daisy, it will generates a big matrix which cannot fit in memory anyway.

I also read about the post about sampling:
Is sampling relevant in the time of 'big data'?

So in my case, is it relevant to use sampling on the data set, cluster on the sample and then infer the structure of the whole data set?

Can you please give me some suggestion? Thank you!

About my machine:

R version 3.0.0 (2013-04-03)

Platform: x86_64-w64-mingw32/x64 (64-bit)

OS: Windows 7 64bit

RAM: 16.0GB

Best Answer

As you have noticed, any method that requires a full distance matrix won't work. Memory is one thing, but the other is runtime. The typical implementations of hierarchical clustering are in $O(n^3)$ (I know that ELKI has SLINK, which is an $O(n^2)$ algorithm to single-link clustering). This just does not scale to large data sets.

PAM itself should not require a complete distance matrix, but the algorithm is known to scale badly, because it then needs to (re-)compute all pairwise distances within each cluster on each iteration to find the most central elements. This is much less if you have a large number of clusters, but nevertheless quite expensive!

Instead, you should look into methods that can use index structures for acceleration. With a good index, such clustering algorithms can run in $O(n log n)$ which is much better for large data sets.

However, for most of these algorithms, you first need to make sure your distance function is really good; then you need to consider ways to accelerate queries by using appropriate indexes.

Also note that in many cases - and this may well hold for PAM - you can run the algorithm on a sample first, then only refine it on the full data set. If your sample is representative, algorithms such as k-means and PAM should give you essentially the same result as on the complete data set.