Solved – Why only the mean value is used in (K-means) clustering method

clusteringgaussian mixture distributionk-meansnormal distributionunsupervised learning

In clustering methods such as K-means, the euclidean distance is the metric to use. As a result, we only calculate the mean values within each cluster. And then adjustments are made on the elements based on their distance to each mean value.

I was wondering why the Gaussian function is not used as the metric? Instead of using xi -mean(X), we can use exp(- (xi - mean(X)).^2/std(X).^2). Thus not only the similarity among the clusters are measured (mean), but the similarity within the cluster is also considered (std). Is this also equivalent to the Gaussian mixture model?

It is beyond my question here but I think mean-shift may arise the same question above.

Best Answer

There a literally thousands of k-means variations. Including soft assignment, variance and covariance (usually referred to as Gaussian Mixture Modeling or EM algorithm).

However, I'd like to point out a few things:

  • K-means is not based on Euclidean distance. It's based on variance minimization. Since the variance is the sum of the squared Euclidean distances, the minimum variance assignment is the one that has the smallest squared Euclidean, and the square root function is monotone. For efficiency reasons, it actually is smarter to not compute Euclidean distance (but use the squares)

  • If you plug in a different distance function into k-means it may stop converging. You need to minimize the same criterion in both steps; the second step is recomputing the means. Estimating the center using the arithmetic mean is a least squares estimator, and it will minimize variance. Since both functions minimize variance, k-means must converge. If you want to ensure convergence with other distances, use PAM (partitioning around medoids. The medoid minimizes the within-cluster distances for arbitrary distance functions.)

But in the end, k-means and all of its variations are IMHO more of an optimization (or more precisely, a vector quantization algorithm) than actually a cluster analysis algorithm. They will not actually "discover" structure. They will massage your data into k partitions. If you give them uniform data, with no structure beyond randomness at all, k-means will still find however many "clusters" you want it to find. k-means is happy with returning results that are essentially random.