Solved – NbClust with large data sets. Sampling

clusteringrsampling

I have a data set which is 3 columns and about 50,000 rows. I am trying to do some cluster analysis and would like to speed it up. Most examples online use data sets like iris which are relatively tiny. Take the following code:

nb <- NbClust(d, distance = "euclidean", min.nc = 2, max.nc = 30, method = "complete", index ="all")

This has been running for 24 hours so far on a server with 32 cores and 100 gig of ram, although only one core will be being used.

There doesn't seem to be a way of doing any parallel processing unless I use a loop that tackles each index choice individually. That will speed up process considerably. However the real issue is size of the data. And 50,000 rows is hardly 'Big data'.

It seems using samples to speed up the process is a good idea. Like how clara is much faster than pam.

My question is, if I were to take samples and run NbClust, what size should the samples be and how many samples are sufficient?

I'd like the answer as a proportion of n – the number of total rows, such that this answer scales.

A final note would be that before doing any of this work, I would run clustend (or hopkins) to check the Hopkin's statistic to see if clustering even makes sense.

I have run:

clustend <- get_clust_tendency(scale(d), 5000)

But on large data sets this will also take a long time and I'm not sure if using samples like my above idea is wise or if you should really test the whole set to be sure. I'm also not sure if 5000 was the right number to choose, I'd seen examples where the number chosen was roughly 1/3 of the size of the data set, however these sets were under 300 rows and the choice hadn't been explained well.

Sample data:

 set.seed(123)
         d <- data.frame(x = unlist(lapply(1:500, function(i) rnorm(100, runif(1)*2^2))), 
               y = unlist(lapply(1:500, function(i) rnorm(100, runif(1)*2^2))),
               z = unlist(lapply(1:500, function(i) rnorm(100, runif(1)*2^2))))

UPDATE:

Here are the results of nb clust, which is why I don't want to use less tests and also may be evidence why a sample may misrepresent the data

* Among all indices:
* 6 proposed 2 as the best number of clusters
* 5 proposed 3 as the best number of clusters
* 2 proposed 5 as the best number of clusters
* 4 proposed 8 as the best number of clusters
* 2 proposed 10 as the best number of clusters
* 1 proposed 13 as the best number of clusters
* 1 proposed 23 as the best number of clusters
* 1 proposed 24 as the best number of clusters
* 1 proposed 27 as the best number of clusters

Best Answer

Many clustering algorithms depend on some distance measure between the points. With 50,000 points, the distance matrix for your data will be too big for many computing environments. It may be that the resultant matrix could fit into your 100 Gigabytes of RAM, but it does not surprise me that computing on that scale take a long time.

Of course, sampling is a natural response to that problem. I do not have a direct answer for you about sample size or number of samples, but I do have a suggestion about what you should consider in making those choices. I am afraid it is a rather unsatisfying suggestion: it depends on your data and what you want to do with it.

First imagine that your 50,000 data points consists of two natural clusters that are fairly well separated from each other and approximately equal in size. Each of the clusters contain about 25,000 points and suppose further that there is a small amount of noise in your data so that there are 500 (1%) noise points that do not cluster well with anything. Your comment made the (reasonable) proposal of taking 10 samples of size 1000. This procedure should do a good job of revealing the underlying structure of data like this and help you to make good decisions about how to handle the full data set.

However, suppose instead that your data has those same two well separated clusters of equal size, 400 noise points and also has another group of about 100 points that are tightly clustered and well separated from the two main clusters. Your samples of size 1000 would mostly contain about 2 points from this smaller cluster and 8 of the noise points. It would be very easy to simply interpret the points from the small cluster as part of the noise, completely missing the existence of the smaller cluster.

Sampled data

Seeing these two possibilities, it seems important to ask "Should you care?". That is why it matters what you want to do with the clustering. Suppose you were developing target marketing campaign for a mass-market product with broad appeal. Ignoring the small cluster (0.2% of the market) might be just fine. But if you were trying to isolate the market for a product that only appeals to a small, select group, ignoring that group could be a disaster. Is a coarse description of your data sufficient for your purposes or do you need to know about smaller sub-populations? Your sample size together with the size of any sub-populations will determine whether they are apparent in your samples.