Clustering Techniques – When to Combine Dimensionality Reduction with Clustering in PCA and SVD

clusteringdimensionality reductionpcasvdtext mining

I am trying to perform document-level clustering. I constructed the term-document frequency matrix and I am trying to cluster these high dimensional vectors using k-means. Instead of directly clustering, what I did was to first apply LSA's (Latent Semantic Analysis) singular vector decomposition to obtain the U,S,Vt matrices, selected a suitable threshold using the scree plot and applied clustering on the reduced matrices (specifically Vt because it gives me a concept-document information) which seemed to be giving me good results.

I've heard some people say SVD (singular vector decomposition) is clustering (by using cosine similarity measure etc.) and was not sure if I could apply k-means on the output of SVD. I thought it was logically correct because SVD is a dimensionality reduction technique, gives me a bunch of new vectors. k-means, on the other hand, will take the number of clusters as the input and divide these vectors into the specified number of clusters. Is this procedure flawed or are there ways in which this can be improved? Any suggestions?

Best Answer

This is by no means a complete answer, the question you should be asking is "what kind of distances are preserved when doing dimensionality reduction?". Since clustering algorithms such as K-means operate only on distances, the right distance metric to use (theoretically) is the distance metric which is preserved by the dimensionality reduction. This way, the dimensionality reduction step can be seen as a computational shortcut to cluster the data in a lower dimensional space. (also to avoid local minima, etc)

There are many subtleties here which I will not pretend to understand, (local distances vs global distances, how relative distances are distorted, etc) but I think this is the right direction to to think about these things theoretically.