Solved – LSA vs. PCA (document clustering)

clusteringdata mininglatent-semantic-analysispcasvd

I'm investigation various techniques used in document clustering and I would like to clear some doubts concerning PCA (principal component analysis) and LSA (latent semantic analysis).

First thing – what are the differences between them? I know that in PCA, SVD decomposition is applied to term-covariance matrix, while in LSA it's term-document matrix. Is there anything else?

Second – what's their role in document clustering procedure? From what I have read so far, I deduce that their purpose is reduction of the dimensionality, noise reduction and incorporating relations between terms into the representation. After executing PCA or LSA, traditional algorithms like k-means or agglomerative methods are applied on the reduced term space and typical similarity measures, like cosine distance are used. Please correct me if I'm wrong.

Third – does it matter if the TF/IDF term vectors are normalized before applying PCA/LSA or not? And should they be normalized again after that?

Fourth – let's say I have performed some clustering on the term space reduced by LSA/PCA. Now, how should I assign labels to the result clusters? Since the dimensions don't correspond to actual words, it's rather a difficult issue. The only idea that comes to my mind is computing centroids for each cluster using original term vectors and selecting terms with top weights, but it doesn't sound very efficient. Are there some specific solutions for this problem? I wasn't able to find anything.

I will be very grateful for clarifying these issues.

Best Answer

  1. PCA and LSA are both analyses which use SVD. PCA is a general class of analysis and could in principle be applied to enumerated text corpora in a variety of ways. In contrast LSA is a very clearly specified means of analyzing and reducing text. Both are leveraging the idea that meaning can be extracted from context. In LSA the context is provided in the numbers through a term-document matrix. In the PCA you proposed, context is provided in the numbers through providing a term covariance matrix (the details of the generation of which probably can tell you a lot more about the relationship between your PCA and LSA). You may want to look here for more details.
  2. You are basically on track here. The exact reasons they are used will depend on the context and the aims of the person playing with the data.
  3. The answer will probably depend on the implementation of the procedure you are using.
  4. Carefully and with great art. Most consider the dimensions of these semantic models to be uninterpretable. Note that you almost certainly expect there to be more than one underlying dimension. When there is more than one dimension in factor analysis, we rotate the factor solution to yield interpretable factors. However, for some reason this is not typically done for these models. Your approach sounds like a principled way to start your art... although I'd be less than certain the scaling between dimensions is similar enough to trust a cluster analysis solution. If you want to play around with meaning, you might also consider a simpler approach in which the vectors have a direct relationship with specific words, e.g. HAL.