Latent Semantic Analysis – Computing Document Similarity Using Latent Semantic Analysis

clusteringdata mininglatent-semantic-analysis

I have a question regarding Latent Semantic Analysis – after performing SVD decomposition of term-document matrix and choosing some number of dimensions, I get the set of new document vectors.

Now, how can I calculate similarity between two documents? New document vectors contain negative values, and results produced by cosine similarity make no sense.

Best Answer

It is normal for the new document vectors to contain negative values. The new dimensions correspond to concepts (though incomprehensible) in the lower dimensional space and a negative value means that the corresponding document is not related with that concept.

What do you mean by "results produced by cosine similarity make no sense" Cosine similarity should work fine. You can also try Pearson correlation (centered version of cosine).

Related Solutions

Clustering Techniques – When to Combine Dimensionality Reduction with Clustering in PCA and SVD

This is by no means a complete answer, the question you should be asking is "what kind of distances are preserved when doing dimensionality reduction?". Since clustering algorithms such as K-means operate only on distances, the right distance metric to use (theoretically) is the distance metric which is preserved by the dimensionality reduction. This way, the dimensionality reduction step can be seen as a computational shortcut to cluster the data in a lower dimensional space. (also to avoid local minima, etc)

There are many subtleties here which I will not pretend to understand, (local distances vs global distances, how relative distances are distorted, etc) but I think this is the right direction to to think about these things theoretically.

Solved – Can LSA be used for document similarity

In general LSA is meaningful for computing document similarity. However, you need a large collection of documents (more than 100000) because LSA is based on finding associations between words (e.g. it will find that dog and cat are similar words and therefore a document about dogs is similar to a document about cats). If your collection is small no meaningful associations between words can be derived. LSA is just a change of representation and to compute similarity you still will use the cosine on the LSA representation. Originally each document is a sparse vector of dimension e.g. 100000, but after LSA it is a dense vector of dimension e.g. 200.

As you said you already can do cosine similarity on the sparse data (just transformed word counts). Hopefully you already applied stop-wording, stemming and tf-idf normalization. It's useful to know what these transformations achieve because LSA is just another transformation on top of those standard transformation. I'll briefly go over the usefulness of those transformation before I describe what LSA does.

stop wording. Document content is dominated by stop words. Those words are everywhere and if you do similarity without removing them a large portion of the similarity score will be due to stop words. This means noise and making the similarity less precise.
- stemming. Take for example words such as dog and dogs. If you don't do stemming you are missing a chance to improve on the similarity because clearly dog and dogs are related.
- tf-idf normalization. Here the issue is that some words often stop words, but also borderline stopwords (e.g. "reach", "achieve") are going to have high counts. As such the similarity will be dominated by those counts.

If you did all those, you already have a very strong similarity.

One more transformation you can do is to consider related words: for example dog, hound, animal, cat etc. are related. Some of these relationships will be meanigful for your similarity score while others will not. One way to describe LSA, is that LSA first derives such relationships. Those relationships are derived by statistical co-occurence analysis. For example, cats and dogs co-occur frequently and therefore a document about dogs will be more similar to a document about cats than to a document about tomatos. However, sometimes this may not be what you want: austria and germany are similar so if you search about germany you can get documents about austria. In general, LSA will make sense if you want to compare documents based on their topics.

What LSA does is to "enrich" or "expand" each document with related words. One way to achieve this is to just use the LSA representation. The LSA representation of a document is just the sum of the LSA representation of the individual words.

Another more controlled way still using LSA to perform document "expansion" is to take the related words that LSA derives for each word in the document and to selectively "expand" the original document with more words from LSA.

LSA will not work equally well for all documents or all words. Documents whose topics dominate in the collection will likely be improved, while "outlier" documents will be mapped to a "noise" document. Rare words will not have meaningful similar words no matter if LSA is used or not.

Another drawback of LSA is that the representation after LSA is dense, while the original representation is sparse. This means that you will not be able to find the most similar documents quickly using a search engine. However, LSA can be easily applied in a rescoring phase after candidates are found with keyword search. LSA can also be used to derive informative words from a document.

Best Answer

Related Solutions

Clustering Techniques – When to Combine Dimensionality Reduction with Clustering in PCA and SVD

Solved – Can LSA be used for document similarity

Related Question