Solved – Similarity measure between two text documents

similarities

Let the first document involve the words $\{x_1,x_2,\ldots,x_n\}$ and the second one be composed of $\{y_1,y_2,\ldots,y_m\}$ where $n$ is not necessarily equal to $m$. I have a similarity measure that works for elements of the sets; $s(x_i,y_j)$ for all $i$ and $j$. This similarity may not be symmetric, that is, $s(x_i,y_j)\neq s(y_j,x_i)$.

But I'm not familiar about the similarity of two sets of instances. I thought that the average of the best similarities is reasonable:

$$s\left(\{x_1,x_2,\ldots,x_n\},\{y_1,y_2,\ldots,y_m\}\right) = \frac{1}{n}\sum_{i=1}^n \max_{j=1}^m s(x_i,y_j)$$

What are the main methods used to compute $s\left(\{x_1,x_2,\ldots,x_n\},\{y_1,y_2,\ldots,y_m\}\right)$ in such a case? Particularly the similairty of two text documents?

Best Answer

There are various methods to define document similarity, but let me introduce the most easiest approach to start with, based on semantic vector space:

  1. First build your term-document matrix
  2. Then "Normalize" the entries in the matrix with tf-idf
  3. From there, you can use your document-vectors columns of the matrix to calculate the similarity with the cosine similarity for instance

I think it's the most basic approach that gives decent results.

If you have a lot of documents and terms, your matrix can be very big but also very sparse. That's where dimension reduction technics are usually introduced. You can for instance use SVD to define underlying correlated dimensions in your space, and use only the few strongly correlated ones as a new basis for your document vectors. It works pretty well but is demanding in computing resources for very large spaces.

Alternatively, you can use random projection to reduce your space, but it's bit longer to explain.

You must know that there are also libraries that can do this for you. If you want to use random projection for instance, the semantic vector package was built around this idea : http://code.google.com/p/semanticvectors/ , but if I remember well they also have SVD encoded (edit: I checked and they have the Latent Semantic Analysis module implemented -LSA is based on SVD-).

You can look for all the terms used in their corresponding wikipedia entries, I wasn't allowed to put more than 2 links for spam reasons apparently !!!