How is the SVD of an item document matrix – $A$’s component equal to the eigenvector matrix of $AA^T$

matricesmatrix decompositionmatrix equationssvd

I'm trying to get an intuitive grip on SVDs to explore LSI as part of NLP. I understand that it involves the usage of SVDs. As I understand in SVD

$A = SΣU^T$

I understand that $S$ is a matrix of eigenvectors of $A^TA$ and $\Sigma$ is the square root of its eigenvalues as a diagonal matrix.

  1. I don't understand how $U$ is the matrix of eigenvectors of $AA^T$

In LSI, we decompose this SVD matrix to get a shorter vector for each term and document to find out the relative distance between those vectors and the centroid

  1. I don't understand how doing an SVD converts words to their own vectors, by which we can estimate the distance between documents

I'm very new to NLP, I'm finding it hard to get an intuitive sense of the math, I'd sincerely appreciate any help. Is there any books that I can refer to, thanks!

Best Answer

I don't understand how U is the matrix of eigenvectors of A* A(trans)

Correction: $U$ is the matrix of eigenvectors of $AA^T$. If you want to understand this in general read on. Or if your interest in incidental of LSA, then skip to next section.

In SVD, we decompose a matrix $X$ three matrice each with useful properties.

$X = USV^T$. Here $U,V$ are unitary matrices: $U^T = U^{-1}$ and same for $V$. $S$ is a diagonal matrix.

The only property we need is that $U,V$ are unitary matrices.

I'm assuming real matrices every where.

Let's start with $XX^T$:

(please note the dimensions at each step as indicated)

$XX^T = USV^T (USV^T)^T$

$XX^T = USV^T_{n \times n} V_{n \times n}S^TU^T$

Here we use $V^TV = I$:

$XX^T = USI_{n \times n} S^T_{n \times m}U^T$

$I_{n \times n} S^T_{n \times m} = S^T$. Notice that $S$ is a diagonal matrix, so $SS^T$ is also diagonal:

$XX^T = U SS^T U^T$

$XX^T = U SS^T U^{-1}$

This is the same expression as the eigen-decomposition.

You can try a similar chain of proof for $X^TX$ to see that $V$ is a set of eigenvectors of $X^TX$.

Distance between documents

In LSI, we decompose this SVD matrix to get a shorter vector for each term and document to find out the relative distance between those vectors and the centroid

I think you mean "approximate" instead of "decompose."

I don't understand how doing an SVD converts words to their own vectors, by which we can estimate the distance between documents

SVD does not convert the words into vectors. The second part of your sentence is correct though: we can estimate distance between documents by using SVD and approximating it.

This is my understanding: Say you have a set of $n$ documents. To represent them in a mathematically, you strip out the common words such as the, a, here, there. In other words, these are words that you believe can appear in all these documents and have no significance.

After this, you count the remaining words in the document and create a column vector, where each word has a score (the number of times it occurs in this document). So far you have represented $1$ document as a column vector.

Now you do this for all the $n$ documents. Now stack them side by side and you have a matrix, while making sure that each word gets a row. Call this matrix $C$.

At this point what you have is a massive matrix $C$ which is humanly (and computationally) difficult to handle. This is where SVD is useful. It can decompose this matrix into three matrices each of which has an useful property.

$C = USV^T$

Here $S$ is a diagonal matrix. The diagonal elements will have different values, but all positive in this case.

Now you can think of approximating. Say you notice that most diagonal elements of $S$ close to zero. You set them to zero. Now with this thresholding done, you can reconstruct a matrix $C'$ by multiplying the three matrices again.

Mathematically, this $C'$ is an approximation of $C$.

More importantly, after we set threshold the diagonal elements of $S$, we'll have several columns of $S$ that are entirely zero. This means the corresponding rows in $V^T$ are actually useless because they do not contribute to the approximated $C'$.

Let's drop those irrelevant rows from $V^T$ and call this new matrix $V_{truncated}^T$.

$V_{truncated}^T$ contains the same number of columns as $C$ but very few rows. This means we started with $n$ documents and maybe a million rows but now we have $n$ documents and only a handful of rows in $V^T_{truncated}$.

In fact we could threshold the diagonal elements harshly leading to only $2$ rows in our $V^T_{truncated}$.

The product $SV^T_{truncated}$ is a low dimensional representation of our original documents. At this point we could easily draw the low-dimensional documents as 2D points on a piece of paper. Close together points represent similar documents or as you put it "distance between documents."

Some more material

Some intuitive answers here. After that, this and this explain how approximating the SVD achieves in this application i.e. LSI.