Solved – Clustering big dataset (12 million rows data) with categorical and numerical columns

clusteringlarge datarsample-size

I have 6 months of sales data (about 12 million rows non labeled) that i need to cluster.
I am going to use 4 numerical and 1 categorical (2 levels) variable. As you can imagine the amount of the data is really big so i was wondering what i can do to speed up the whole procedure and also if using k_prototypes is the best algorithm to use or if there is a better algorithm that can handle so big mixed type of data.
I know that by getting a sample this will be faster but as my data set is huge how big the sample should be to be representative?
Also as it is a sales data how can i be sure that i will get a representative sample?

Best Answer

First you need to be clear on what you need. Often clustering is not that interesting once you've understood what it actually does...

I'd assume that you first need to prepare the data, for example aggregate it in a more interesting way. That will likely give you more attributes, and much fewer instances.

Then frequent itemsets and association rules are often much more interesting on sales data rather than clusters.

But your data just has one category variable... Just split your data then you can use k-means which will (as long as you use a good algorithm and not Lloyd's) easily scale to the entire data. But since it is k-means, a few thousand data points will be enough, larger data only yield diminishing returns.

Anyway, never scale to a big data set, until you know that your approach works. That would be wasted time and resources to compute something on the big data just to find out that it's not what you wanted in the first place. First use a sample to understand the problem and the solution then work on scaling the solution to the entire data.

Related Question