I wonder what is the usefulness of k-means clustering in high dimensional spaces, and why it can be better (or not) than other clustering methods when dealing with high dimensional spaces.
K-Means Clustering – Usefulness on High Dimensional Data
clusteringk-means
Related Solutions
About k-means specifically, you can use the Gap statistics. Basically, the idea is to compute a goodness of clustering measure based on average dispersion compared to a reference distribution for an increasing number of clusters. More information can be found in the original paper:
Tibshirani, R., Walther, G., and Hastie, T. (2001). Estimating the numbers of clusters in a data set via the gap statistic. J. R. Statist. Soc. B, 63(2): 411-423.
The answer that I provided to a related question highlights other general validity indices that might be used to check whether a given dataset exhibits some kind of a structure.
When you don't have any idea of what you would expect to find if there was noise only, a good approach is to use resampling and study clusters stability. In other words, resample your data (via bootstrap or by adding small noise to it) and compute the "closeness" of the resulting partitions, as measured by Jaccard similarities. In short, it allows to estimate the frequency with which similar clusters were recovered in the data. This method is readily available in the fpc R package as clusterboot()
.
It takes as input either raw data or a distance matrix, and allows to apply a wide range of clustering methods (hierarchical, k-means, fuzzy methods). The method is discussed in the linked references:
Hennig, C. (2007) Cluster-wise assessment of cluster stability. Computational Statistics and Data Analysis, 52, 258-271.
Hennig, C. (2008) Dissolution point and isolation robustness: robustness criteria for general cluster analysis methods. Journal of Multivariate Analysis, 99, 1154-1176.
Below is a small demonstration with the k-means algorithm.
sim.xy <- function(n, mean, sd) cbind(rnorm(n, mean[1], sd[1]),
rnorm(n, mean[2],sd[2]))
xy <- rbind(sim.xy(100, c(0,0), c(.2,.2)),
sim.xy(100, c(2.5,0), c(.4,.2)),
sim.xy(100, c(1.25,.5), c(.3,.2)))
library(fpc)
km.boot <- clusterboot(xy, B=20, bootmethod="boot",
clustermethod=kmeansCBI,
krange=3, seed=15555)
The results are quite positive in this artificial (and well structured) dataset since none of the three clusters (krange
) were dissolved across the samples, and the average clusterwise Jaccard similarity is > 0.95 for all clusters.
Below are the results on the 20 bootstrap samples. As can be seen, statistical units tend to stay grouped into the same cluster, with few exceptions for those observations lying in between.
You can extend this idea to any validity index, of course: choose a new series of observations by bootstrap (with replacement), compute your statistic (e.g., silhouette width, cophenetic correlation, Hubert's gamma, within sum of squares) for a range of cluster numbers (e.g., 2 to 10), repeat 100 or 500 times, and look at the boxplot of your statistic as a function of the number of cluster.
Here is what I get with the same simulated dataset, but using Ward's hierarchical clustering and considering the cophenetic correlation (which assess how well distance information are reproduced in the resulting partitions) and silhouette width (a combination measure assessing intra-cluster homogeneity and inter-cluster separation).
The cophenetic correlation ranges from 0.6267 to 0.7511 with a median value of 0.7031 (500 bootstrap samples). Silhouette width appears to be maximal when we consider 3 clusters (median 0.8408, range 0.7371-0.8769).
First some background:
R is good choice and have so many clustering methods in different packages. The functions include Hierarchical Clustering, Partitioning Clustering, Model-Based Clustering, and Cluster-wise Regression.
Connectivity based clustering or Hierarchical clustering (also called hierarchical cluster analysis or HCA) is a method of cluster analysis which seeks to build a hierarchy of clusters. Strategies for hierarchical clustering generally fall into two types:
- Agglomerative: This is a "bottom up" approach: each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy.
- Divisive: This is a "top down" approach: all observations start in one cluster, and splits are performed recursively as one moves down the hierarchy. In this method, at different distances, different clusters will form, which can be represented using a dendrogram.
Centroid-based clustering: In centroid-based clustering, clusters are represented by a central vector, which may not necessarily be a member of the data set. When the number of clusters is fixed to k, k-means clustering gives a formal definition as an optimization problem: find the k cluster centers and assign the objects to the nearest cluster center, such that the squared distances from the cluster are minimized.
km <- kmeans(iris[,1:4], 3)
plot(iris[,1], iris[,2], col=km$cluster)
points(km$centers[,c(1,2)], col=1:3, pch=8, cex=2)
table(km$cluster, iris$Species)
Distribution-based clustering: the clustering model most closely related to statistics is based on distribution models. Clusters can then easily be defined as objects belonging most likely to the same distribution. The problem is overfitting.
Density-based clustering: n density-based clustering,[8] clusters are defined as areas of higher density than the remainder of the data set. Objects in these sparse areas - that are required to separate clusters - are usually considered to be noise and border points.
In density based cluster, a cluster is extend along the density distribution. Two parameters is important: "eps" defines the radius of neighborhood of each point, and "minpts" is the number of neighbors within my "eps" radius. The basic algorithm called DBscan proceeds as follows:
First scan: For each point, compute the distance with all other points. Increment a neighbor count if it is smaller than "eps".
Second scan: For each point, mark it as a core point if its neighbor count is greater than "mints"
Third scan: For each core point, if it is not already assigned a cluster, create a new cluster and assign that to this core point as well as all of its neighbors within "eps" radius.
Unlike other cluster, density based cluster can have some outliers (data points that doesn't belong to any clusters). On the other hand, it can detect cluster of arbitrary shapes (doesn't have to be circular at all).
library(fpc)
# eps is radius of neighborhood, MinPts is no of neighbors
# within eps
cluster <- dbscan(sampleiris[,-5], eps=0.6, MinPts=4)
plot(cluster, sampleiris)
plot(cluster, sampleiris[,c(1,4)])
# Notice points in cluster 0 are unassigned outliers
table(cluster$cluster, sampleiris$Species)
With the recent need to process larger and larger data sets (also known as big data), the willingness to trade semantic meaning of the generated clusters for performance has been increasing. This led to the development of pre-clustering methods such as canopy clustering, which can process huge data sets efficiently, but the resulting "clusters" are merely a rough pre-partitioning of the data set to then analyze the partitions with existing slower methods such as k-means clustering.
For high-dimensional data, many of the existing methods fail due to the curse of dimensionality, which renders particular distance functions problematic in high-dimensional spaces. This led to new clustering algorithms for high-dimensional data that focus on subspace clustering (where only some attributes are used, and cluster models include the relevant attributes for the cluster) and correlation clustering that also looks for arbitrary rotated ("correlated") subspace clusters that can be modeled by giving a correlation of their attributes. The three clustering algorithms include PROCLUS, P3C and STATPC.
To your question:
The package Package ‘orclus’ is available to perform subspace clustering and classification. The following is example from the manual:
# definition of a function for parameterized data simulation
sim.orclus <- function(k = 3, nk = 100, d = 10, l = 4, sd.cl = 0.05, sd.rest = 1, locshift = 1){
### input parameters for data generation
# k # nk # d # l
# sd.cl
# sd.rest # locshift
number of clusters
observations per cluster
original dimension of the data
subspace dimension where the clusters are concentrated
(univariate) standard deviations for data generation (within cluster subspace concentration) standard deviations in the remaining space
parameter of a uniform distribution to sample different cluster means
x <- NULL
for(i in 1:k){
# cluster centers
apts <- locshift*matrix(runif(l*k), ncol = l)
# sample points in original space
xi.original <- cbind(matrix(rnorm(nk * l, sd = sd.cl), ncol=l) + matrix(rep(apts[i,], nk), ncol = l, byrow = TRUE)
matrix(rnorm(nk * (d-l), sd = sd.rest), ncol = (d-l)))
# subspace generation
sym.mat <- matrix(nrow=d, ncol=d)
for(m in 1:d){
for(n in 1:m){
sym.mat[m,n] <- sym.mat[n,m] <- runif(1)
}
}
subspace <- eigen(sym.mat)$vectors
# transformation
xi.transformed <- xi.original %*% subspace
x <- rbind(x, xi.transformed)
}
clids <- rep(1:k, each = nk)
result <- list(x = x, cluster = clids)
return(result)
}
# simulate data of 2 classes where class 1 consists of 2 subclasses
simdata <- sim.orclus(k = 3, nk = 200, d = 15, l = 4, sd.cl = 0.05, sd.rest = 1, locshift = 1)
x <- simdata$x
y <- c(rep(1,400), rep(2,200))
res <- orclass(x, y, k = 3, l = 4, k0 = 15, a = 0.75)
res
# compare results
table(res$predict.train$class, y)
You may also be interested in HDclassif (An R Package for Model-Based Clustering and Discriminant Analysis of High-Dimensional Data).
Best Answer
Is k-means meaningful at all?
See for example my answer here: https://stats.stackexchange.com/a/35760/7828
k-means optimizes variances. Is the unweighted sum of variances meaningful on your data set? Probably not. How can then k-means be meaningful? In high-dimensional data, distance doesn't work. But variance = squared Euclidean distance; so is it meaningful to optimize something of which you know it doesn't work in high-dimensional data?
For the particular problems of high-dimensional data, I recommend the following study:
It's main focus is outlier detection, but the observations on the challenges of high-dimensional data apply to a much broader context. They show some simple experiments how high-dimensional data can be a problem. What I like about this study is they also show that high-dimensional data can be easy, too; it's not black and white, but you need to carefully study your data.
Useful is different. Often people use k-means not to actually discover clusters.
But to find representative objects. It's a clever way of semi-random sampling k objects that aren't too similar to be useful.
If you only need a clever way of sampling, k-means may be very useful.