Solved – User segmentation by clustering with sparse data

clusteringk-meanssparse

Imagine that I have 100k users and 1k categories. For each user, up to 5 categories, I know how much money they have spent. Obviously my data is very sparse.

Now I want to group users by the money they spend on different categories. This way, I could group together users who are 'cheap' in some certain categories and 'snobby' in some other categories.

After standardizing the values by calculating the number of times of standard deviation they deviate from the category means, I have tried k-means clustering but I ended up one cluster getting bigger and bigger while others shrink to clusters that contain only few users as the number of iterations k-means do increases.

How can I tackle clustering with sparse data problem? Any pointers, suggestions or ideas are appreciated.

Best Answer

$K$-Means is very unlikely to give meaningful clusters on such high dimensional space (see e.g. Curse of Dimensionality).

I agree with the suggestions in the comments: you need to reduce the dimensionality of your data and then do $K$-Means on the reduced space.

However I would not do PCA in the proper way: for PCA you need to do mean normalization, and that will turn a sparse matrix into a dense one. What you can do instead is SVD - without mean normalization - and then apply the clustering algorithm. Also note that Randomized SVD should work fine, but way faster.

Another potentially interesting technique that you can apply in Non-Negative Matrix Factorization. Since your data contains only positive values (if I got it correctly), NMF should suite well for the problem. Also, you can interpret the results of NMF as clustering: when we are doing $n$-dimensional NMF, we can think of the columns of the resulting matrix as clusters, with the value in the cell $i$ being the degree of association of the observation to the cluster $i$.

You can read more about applying NMF for clustering in "Document clustering based on non-negative matrix factorization." by Xu, Wei, Xin Liu, and Yihong Gong (pdf).

Best Answer

Related Solutions

Solved – Clustering algorithms for extremely sparse data

Solved – Dirichlet Processes for clustering: how to deal with labels

Related Question