Solved – K-means classifies 96% of the data in 1 cluster. Any suggestions to improve the results

clusteringk-meansnormalizationunsupervised learning

Problem: K-Means clustering shows 96% of my data belongs to one cluster. How can I improve my results or should I conclude that no cluster exists in my dataset. Dbscan clustering shows 1 cluster exists.

Dataset: I have monthly purchase data for 15 products across all customers for the past 15 months. Most of my customers would use a combination of these 15 products. However the proportion of different products purchased varies across customers and occasionally may vary over months. Usually the type of products a customer purchases is similar month over month. For example I have few customers who buy mostly P4 and P5 and very less of others, few customers only buy P10 and less of other products in comparison. Most of my customers will but 10-12 products. My customers usually make daily transactions and the data I have is aggregated monthly. There is huge variance across purchase data. So the (min, max) for P10 is (0,2.685250e+07 ) while (min, max) for P3 is (0, 5843). This min, max is across all customers, all months.

Methodology: I did a min-max scaling to normalize my dataset. However when I plot the scatter-plot, I don't see a normal curve (is it because this is count data and not continuous data?). After normalization, I did PCA analysis which reduced my feature set from 15 to 12.

I plotted the elbow curve. Below are two plots. Elbow curve plotted on scaled data:

Elbow curve on scaled data

Elbow curve on PCA reduced data:

Elbow curve on PCA reduced data

I then did a K-means clustering on PCA reduced data and form two clusters. My cluster 0 has 3397 entries while cluster 1 has only 128 entries! Below is a scatter plot of my clusters:

enter image description here

How can I make better clusters? What algorithm should I use? I want to understand what product combinations are different customers buying. This doesn't identify my customers who mostly buy P10 and not much of others and similar customers.

Best Answer

What if the data does contain only one large cluster with 95% of the data in it? Maybe most of your customers behave similarly in the available data?

Your visualization only shows one big cluster. So k-means does what it is supposed to do.

Your best bet is to pre-process the data differently and also consider other algorithms. Since your data is not continuous and quite sparse, I would rather use association rule mining to identify buying patterns rather than clustering.