I'm reading a book "Machine learning with Spark" by Nick Pentreath, and at page 224-225 the author discusses about using K-means as a form of dimensionality reduction.
I have never seen this kind of dimensionality reduction, does it has a name or/and is useful for specific shapes of data?
I quote the book describing the algorithm:
Assume that we cluster our high-dimensional feature vectors using
a K-means clustering model, with k clusters. The result is a set of k
cluster centers.We can represent each of our original data points in terms of how far it is
from each of these cluster centers. That is, we can compute the distance of
a data point to each cluster center. The result is a set of k distances for each
data point.These k distances can form a new vector of dimension k. We can now
represent our original data as a new vector of lower dimension, relative to
the original feature dimension.
Author suggests a Gaussian distance.
With 2 clusters for 2 dimensional data, I have the following:
K-means:
Applying the algorithm with norm 2:
Applying the algorithm with a Gaussian distance (applying dnorm(abs(z)):
R code for the previous pictures:
set.seed(1)
N1 = 1000
N2 = 500
z1 = rnorm(N1) + 1i * rnorm(N1)
z2 = rnorm(N2, 2, 0.5) + 1i * rnorm(N2, 2, 2)
z = c(z1, z2)
cl = kmeans(cbind(Re(z), Im(z)), centers = 2)
plot(z, col = cl$cluster)
z_center = function(k, cl) {
return(cl$centers[k,1] + 1i * cl$centers[k,2])
}
xlab = "distance to cluster center 1"
ylab = "distance to cluster center 2"
out_dist = cbind(abs(z - z_center(1, cl)), abs(z - z_center(2, cl)))
plot(out_dist, col = cl$cluster, xlab = xlab, ylab = ylab)
abline(a=0, b=1, col = "blue")
out_dist = cbind(dnorm(abs(z - z_center(1, cl))), dnorm(abs(z - z_center(2, cl))))
plot(out_dist, col = cl$cluster, xlab = xlab, ylab = ylab)
abline(a=0, b=1, col = "blue")
Best Answer
I think this is the "centroid method" (or the closely-related "centroidQR" method) described by Park, Jeon and Rosen. From Moon-Gu Jeon's thesis abstract:
It also seems to be equivalent to the "multiple group" method from Factor Analysis.