Problem:
I am figuring out the best way to find clusters for a dataset with observations that are densely packed together. The dataset is retail stores with three numeric variables based on operations metrics.
I do not know how to create a simulated dataset for an example like this. I have densely clustered data and outliers, but under 4k observations.
Business objective:
We need to separate the dataset into groups based on several variables.
The goal is to narrow down the stores with greater priority. Later on, we will use inference statistics for determining the cause of the operation metrics stated. Segmenting the stores based on priority makes sense through the three operations variables included.
I tried two different types of partitioning clustering methods, k-values, and different variables, but all yeilded poor validation results. Here’s the steps I took:
Clustering with 2/3 variables:
-
Standardize in daisy dissimilarity matrix with euclidean distance
daisy()
function fromcluster
package in CRAN. -
Chose k for k-means by looking at SSE chart
kmeans()
function. -
Chose k for k-medoid by
pamk()
function infpc
package in CRAN for highest average silhouette width among clusters – resulted in a 0.23 average silhouette width. K-medoid was used with thepam()
function fromcluster
package in CRAN. -
Choose clustering algorithm by dunn-index – highest clustering result was k-medoids with 0.002. I used the
cluster-stats()
function infpc
.
Clustering with all three variables:
-same procedure as above.
Result:
K-medoids with 2 clusters using two variables represented the algorithm with the highest dunn-indes.
Overview:
After selecting the optimal number of clusters for each clustering method and comparing the best one using dunn-index, the results have overlap.
What is the recommended method for performing cluster analysis on densely clustered datasets? Do I need to perform clustering multiple times in order to segment the data further?
EDIT: Added scatterplot showing clustering with 3 variables
Best Answer
As @Anony-Mousse implies, it isn't clear right now that your data actually are clusterable. In the end, you may choose to simply chop your data into partitions, if that will serve your business purposes, but there may not be any real latent groupings.
From where I sit, I cannot provide any guaranteed solutions, but perhaps I can offer some suggestions that will be profitable:
At this point (and only now), you can try clustering. I would not use k-means or k-medoids. Since you will have overlapping clusters, you will need a method that can handle that. The clustering algorithms I am familiar with that can do so are fuzzy k-means, Gaussian mixture modeling, and clustering by kernel density estimation.
mclust
on CV here and here.There is a continuity here: Fuzzy k-means essentially approximates GMM, but imposes sphericality on your clusters, which GMM does not do. GMMs make a very strong assumption that each cluster is multivariate normal (albeit possibly with different variances and covariances). If that isn't (nearly) perfectly true, the results can be distorted. Moreover, although kernel density estimates use a multivariate Gaussian kernel by default, the end result can be much more flexible and needn't yield multivariate normal clusters at all. This line of reasoning may suggest you simply go with the latter, but if the former constraints / assumptions hold they will benefit your analysis.
1. Azzalini, A. & Torelli, N. (2007). Clustering via nonparametric density estimation, Statisticis and Computing, 17, 1, pp. 71-80.