Solved – Clustering as dimensionality reduction

clusteringdimensionality reductionk-means

I'm reading a book "Machine learning with Spark" by Nick Pentreath, and at page 224-225 the author discusses about using K-means as a form of dimensionality reduction.

I have never seen this kind of dimensionality reduction, does it has a name or/and is useful for specific shapes of data?

I quote the book describing the algorithm:

Assume that we cluster our high-dimensional feature vectors using
a K-means clustering model, with k clusters. The result is a set of k
cluster centers.

We can represent each of our original data points in terms of how far it is
from each of these cluster centers. That is, we can compute the distance of
a data point to each cluster center. The result is a set of k distances for each
data point.

These k distances can form a new vector of dimension k. We can now
represent our original data as a new vector of lower dimension, relative to
the original feature dimension.

Author suggests a Gaussian distance.

With 2 clusters for 2 dimensional data, I have the following:

K-means:

K-means with 2 clusters

Applying the algorithm with norm 2:

norm 2

Applying the algorithm with a Gaussian distance (applying dnorm(abs(z)):

Gaussian

R code for the previous pictures:

set.seed(1)
N1 = 1000
N2 = 500
z1 = rnorm(N1) + 1i * rnorm(N1)
z2 = rnorm(N2, 2, 0.5) + 1i * rnorm(N2, 2, 2)
z = c(z1, z2)

cl = kmeans(cbind(Re(z), Im(z)), centers = 2)

plot(z, col = cl$cluster)

z_center = function(k, cl) {
  return(cl$centers[k,1] + 1i * cl$centers[k,2])
}

xlab = "distance to cluster center 1"
ylab = "distance to cluster center 2"

out_dist = cbind(abs(z - z_center(1, cl)), abs(z - z_center(2, cl)))
plot(out_dist, col = cl$cluster, xlab = xlab, ylab = ylab)
abline(a=0, b=1, col = "blue")

out_dist = cbind(dnorm(abs(z - z_center(1, cl))), dnorm(abs(z - z_center(2, cl))))
plot(out_dist, col = cl$cluster, xlab = xlab, ylab = ylab)
abline(a=0, b=1, col = "blue")

Best Answer

I think this is the "centroid method" (or the closely-related "centroidQR" method) described by Park, Jeon and Rosen. From Moon-Gu Jeon's thesis abstract:

Our Centroid method projects full dimensional data onto the centroid space of its classes, which gives tremendous dimensional reduction, reducing the number of dimension to the number of classes while improving the original class structure. One of its interesting properties is that even when using two different similarity measures, the results of classification for the full and the reduced dimensional space formed by the Centroid are identical when the centroid-based classification is applied. The second method, called CentroidQR, is a variant of our Centroid method, which uses as a projection space, k columns of orthogonal matrix Q from QR decomposition of the centroid matrix.

It also seems to be equivalent to the "multiple group" method from Factor Analysis.