[Math] How close apart are two message – “Document Distance” algorithm

linear algebramatricesvectors

I was looking at this algorithm that computes how close apart are two texts from one another and the formula seems a bit weird to me.

The basic steps are:

  1. For each word encountered in a text you let a vector hold its frequency. For $n$ different words our vectors would then be $R^n$. You can assume that the word of row $i$ of vector $1$ equals row $j$ of vector $2$. For example row $0$ of vector $1$ has the value $4$ and row $0$ of vector $2$ has a value of $3$, both referring to the word Dog but one text having $4$ occurrences of the word Dog while the other text having only $3$ occurrences of the word Dog. If a word appears in one text but not in the other text then the vector that corresponds to the text not having that word will have a value of $0$.

  2. Compute the distance between the two vectors using the formula:
    $$\arccos\left(\frac{L1\cdot L2}{\sqrt{(L1\cdot L1)(L2\cdot L2)}}\right)\tag1$$

Why use the above equation when we could use Euclidean distance instead?
$$d(p,q) = \sqrt{(p_1-q_1)^2 + (p_2-q_2)^2 + \cdots + (p_i-q_i)^2 + \cdots + (p_n-q_n)^2}.$$

It seems like a simpler way to go about finding the distance between two vectors. I don't really understand the equation (1) too well so i am not sure if it is more appropriate in finding how close apart are two texts.

Best Answer

The short answer is:

  • The first distance is just the angle between the two vectors representing the two documents.
  • The answer to why should one prefer the angle over the actual distance of the word vectors is that this is just a choice of model: people may have empirically observed that that metric is more meaningful. In fact, if you look at the normalized frequencies, i.e., if you normalize the representation vector by the length of the document, the two distances are equivalent.

And longer...

As you described, here each document $j$ is represented by a vector $\mathbf{x}^{(j)}$, where $x_{i}^{(j)}$ holds the number of times word $i$ appears in document $j$. Note that this is already a choice of model: the same way you wonder about why prefer one distance over the other, you could wonder why this is a good representation for documents. The answer, I guess, is that it is just a simple model that works for now (or at least for a start).

Now you need to define a distance between vectors, because you need a quantitive way to describe the similarity of documents. One observation that might make intuitive sense (it is again part of the model that one chooses based on observation) is that the length of the document should not matter:

  • We implicitly assume that the topic of a document is characterized by the relative frequencies of words. If a document has exactly the same words as another document, only each words appears twice as much in the second document, we still believe that the two documents are as similar as it gets.

Hence, it makes sense to "normalize" the vector describing a document: given $\mathbf{x}^{(j)}$, the vector describing document $j$, we will actually represent the document by $$ \mathbf{y}^{(j)} = \frac{\mathbf{x}^{(j)}}{\|\mathbf{x}^{(j)}\|} = \frac{\mathbf{x}^{(j)}}{\sqrt{{\mathbf{x}^{(j)}}^{T}{\mathbf{x}^{(j)}}}} $$

So now, you represent document $j$ by a unit length vector $\mathbf{y}^{(j)}$. You could choose to measure the distance of two documents $j$ and $l$ by either the angle between their vectors $\mathbf{y}^{(j)}$ and $\mathbf{y}^{(l)}$, which is given by $$ \theta_{jl} = \arccos (\mathbf{y}^{(j)} \cdot \mathbf{y}^{(l)} ) = \arccos (\frac{\mathbf{x}^{(j)}}{\sqrt{{\mathbf{x}^{(j)}}^{T}{\mathbf{x}^{(j)}}}} \cdot \frac{\mathbf{x}^{(l)}}{\sqrt{{\mathbf{x}^{(l)}}^{T}{\mathbf{x}^{(l)}}}}), $$ as described, or the euclidean distance as you suggested. They are not that different, are they? If I give you the angle between two unit norm vectors you should be able to get the euclidean distance and vice versa.

Finally, note this as a final remark. Instead of going all the way to compute the actual angle $\theta_{jl}$, you could just compute the $\cos \theta_{jl} = \mathbf{y}^{(j)} \cdot \mathbf{y}^{(l)} $. Since they vectors $\mathbf{y}^{(j)}$ are all nonnegative by construction, you know that $\cos \theta_{jl}$ takes values between $0$ and $1$. Hence something like $1-\cos \theta_{jl}$ could be used as a "similarity" metric between documents $j$ and $l$. Of course, you could do something like that with Euclidean distance as well. But this is just a choice of model...

Related Question