R – Effective Methods for Clustering a Dense Dataset

clusteringr

Problem:

I am figuring out the best way to find clusters for a dataset with observations that are densely packed together. The dataset is retail stores with three numeric variables based on operations metrics.

I do not know how to create a simulated dataset for an example like this. I have densely clustered data and outliers, but under 4k observations.

Business objective:

We need to separate the dataset into groups based on several variables.

The goal is to narrow down the stores with greater priority. Later on, we will use inference statistics for determining the cause of the operation metrics stated. Segmenting the stores based on priority makes sense through the three operations variables included.

I tried two different types of partitioning clustering methods, k-values, and different variables, but all yeilded poor validation results. Here’s the steps I took:

Clustering with 2/3 variables:

  1. Standardize in daisy dissimilarity matrix with euclidean distance daisy() function from cluster package in CRAN.

  2. Chose k for k-means by looking at SSE chart kmeans() function.

  3. Chose k for k-medoid by pamk() function in fpc package in CRAN for highest average silhouette width among clusters – resulted in a 0.23 average silhouette width. K-medoid was used with the pam() function from cluster package in CRAN.

  4. Choose clustering algorithm by dunn-index – highest clustering result was k-medoids with 0.002. I used the cluster-stats() function in fpc.

Clustering with all three variables:
-same procedure as above.

Result:
K-medoids with 2 clusters using two variables represented the algorithm with the highest dunn-indes.

Overview:
After selecting the optimal number of clusters for each clustering method and comparing the best one using dunn-index, the results have overlap.

What is the recommended method for performing cluster analysis on densely clustered datasets? Do I need to perform clustering multiple times in order to segment the data further?

EDIT: Added scatterplot showing clustering with 3 variables

Scatterplot with cluster labels color-coded

Best Answer

As @Anony-Mousse implies, it isn't clear right now that your data actually are clusterable. In the end, you may choose to simply chop your data into partitions, if that will serve your business purposes, but there may not be any real latent groupings.

From where I sit, I cannot provide any guaranteed solutions, but perhaps I can offer some suggestions that will be profitable:

  1. You have a single clear outlier (in the upper right corner of the [2,3] scatterplot, e.g.) that will likely distort any analysis you try. You may want to try to investigate that store separately. In the interim, I would set that point aside.
  2. It isn't clear how much data you have, but it looks like a lot. You state that you have "under 4k observations". If it is close to that amount, say >3k, then you have a lot. Since a good deal of exploratory data analysis will be necessary, I would randomly partition your data into two halves and explore the first half and then validate your choices with the other half afterwards.
  3. I would experiment with various transformations of your variables to see if you can get better (i.e., more spherical) distributions. For example, taking the logarithm of your data may be appropriate. After finding a suitable transformation, check again for outliers.
  4. Then you will need to standardize each variable so that its mean is 0 and standard deviation is 1. Be sure to keep the original mean and SD for each variable so that you can apply exactly the same transformation later when you work with the second set.
  5. At this point (and only now), you can try clustering. I would not use k-means or k-medoids. Since you will have overlapping clusters, you will need a method that can handle that. The clustering algorithms I am familiar with that can do so are fuzzy k-means, Gaussian mixture modeling, and clustering by kernel density estimation.

    • Fuzzy k-means is discussed on CV here; you can also try this search. To perform fuzzy k-means in R, you can use ?fanny.
    • Threads about Gaussian mixture modeling can be found on the site with the tag. Finite mixture modeling can be done in R with the mclust package. I have demonstrated GMM with mclust on CV here and here.
    • Clustering by kernel density estimation is probably more esoteric. You can read the original paper1 here. You can use kernel densities to cluster in R with the pdfCluster package.

    There is a continuity here: Fuzzy k-means essentially approximates GMM, but imposes sphericality on your clusters, which GMM does not do. GMMs make a very strong assumption that each cluster is multivariate normal (albeit possibly with different variances and covariances). If that isn't (nearly) perfectly true, the results can be distorted. Moreover, although kernel density estimates use a multivariate Gaussian kernel by default, the end result can be much more flexible and needn't yield multivariate normal clusters at all. This line of reasoning may suggest you simply go with the latter, but if the former constraints / assumptions hold they will benefit your analysis.

  6. You mention a variety of cluster validation metrics that you are using. Those are valuable, but I would select the method and the final clustering solution by which possibility makes sense of the data given your knowledge of the topic and whether it provides actionable business intelligence. You should also try to visualize the clusters in various ways.
  7. Check your chosen strategy by performing the exact same preprocessing and clustering on the other half of your data and see if you get similar and equally coherent / valuable results.

1. Azzalini, A. & Torelli, N. (2007). Clustering via nonparametric density estimation, Statisticis and Computing, 17, 1, pp. 71-80.

Related Question