See, even hierarchical clustering needs parameters if you want to get a partitioning out. In fact, hierarchical clustering has (roughly) four parameters: 1. the actual algorithm (divisive vs. agglomerative), 2. the distance function, 3. the linkage criterion (single-link, ward, etc.) and 4. the distance threshold at which you cut the tree (or any other extraction method).
Fact is that there doesn't exist any good "push button" solution to cluster analysis. It is an explorative technique, meaning that you have to try different methods and parameters and analyze the result.
I found DBSCAN to be very usable in most cases. Yes, it has two parameters (distance threshold aka: neighbor predicate, and minpts aka core predicate) - I'm not counting the distance function separately this time, because it's really a "is neighbor of" binary predicate that is needed; see GDBSCAN.
The reason is that in many applications you can choose these values intuitively if you have understood your data well enough. E.g. when working with Geo data, distance is literatlly in kilometers, and it allows me to intuitively specify the spatial resolution.
Similarly, minpts gives me an intuitive control over how "significant" a subset of observations needs to be before it becomes a cluster.
Usually, when you find DBSCAN hard to use, it is because you have not understood "distance" on your data yet. You then first need to figure out how to measure distance and what the resulting numbers mean to you. Then you'll know the threshold to use.
And in the end go and try out stuff. It's data exploration, not "return(truth);
". There is not "true" clustering. There are only "obvious", "useless" and "interesting" clusterings, and these qualities cannot be measured mathematically; they are subjective to the user.
As @Anony-Mousse implies, it isn't clear right now that your data actually are clusterable. In the end, you may choose to simply chop your data into partitions, if that will serve your business purposes, but there may not be any real latent groupings.
From where I sit, I cannot provide any guaranteed solutions, but perhaps I can offer some suggestions that will be profitable:
- You have a single clear outlier (in the upper right corner of the [2,3] scatterplot, e.g.) that will likely distort any analysis you try. You may want to try to investigate that store separately. In the interim, I would set that point aside.
- It isn't clear how much data you have, but it looks like a lot. You state that you have "under 4k observations". If it is close to that amount, say >3k, then you have a lot. Since a good deal of exploratory data analysis will be necessary, I would randomly partition your data into two halves and explore the first half and then validate your choices with the other half afterwards.
- I would experiment with various transformations of your variables to see if you can get better (i.e., more spherical) distributions. For example, taking the logarithm of your data may be appropriate. After finding a suitable transformation, check again for outliers.
- Then you will need to standardize each variable so that its mean is 0 and standard deviation is 1. Be sure to keep the original mean and SD for each variable so that you can apply exactly the same transformation later when you work with the second set.
At this point (and only now), you can try clustering. I would not use k-means or k-medoids. Since you will have overlapping clusters, you will need a method that can handle that. The clustering algorithms I am familiar with that can do so are fuzzy k-means, Gaussian mixture modeling, and clustering by kernel density estimation.
- Fuzzy k-means is discussed on CV here; you can also try this search. To perform fuzzy k-means in R, you can use ?fanny.
- Threads about Gaussian mixture modeling can be found on the site with the gaussian-mixture tag. Finite mixture modeling can be done in R with the mclust package. I have demonstrated GMM with
mclust
on CV here and here.
- Clustering by kernel density estimation is probably more esoteric. You can read the original paper1 here. You can use kernel densities to cluster in R with the pdfCluster package.
There is a continuity here: Fuzzy k-means essentially approximates GMM, but imposes sphericality on your clusters, which GMM does not do. GMMs make a very strong assumption that each cluster is multivariate normal (albeit possibly with different variances and covariances). If that isn't (nearly) perfectly true, the results can be distorted. Moreover, although kernel density estimates use a multivariate Gaussian kernel by default, the end result can be much more flexible and needn't yield multivariate normal clusters at all. This line of reasoning may suggest you simply go with the latter, but if the former constraints / assumptions hold they will benefit your analysis.
- You mention a variety of cluster validation metrics that you are using. Those are valuable, but I would select the method and the final clustering solution by which possibility makes sense of the data given your knowledge of the topic and whether it provides actionable business intelligence. You should also try to visualize the clusters in various ways.
- Check your chosen strategy by performing the exact same preprocessing and clustering on the other half of your data and see if you get similar and equally coherent / valuable results.
1. Azzalini, A. & Torelli, N. (2007). Clustering via nonparametric density estimation, Statisticis and Computing, 17, 1, pp. 71-80.
Best Answer
Sounds like you need HAC (hierarchical agglomerative clustering). There are many variants, but the basic idea is that you start with singleton clusters and progressively merge, based on different ways of determining which clusters are the "closest".
For more on HAC, see the wikipedia entry.