Suppose we have a bunch of large vectors $x_1,\ldots,x_N$ stored as the columns of a matrix $X$. It would be nice if we could somehow find a small number of vectors $u_1,\ldots,u_s$ such that each vector $x_i$ is (to a good approximation) equal to a linear combination of the vectors $u_1,\ldots, u_s$. This would allow us to describe each of the (very large) vectors $x_i$ using just a small number of coefficients.
So we want to find vectors $u_1,\ldots, u_s$ such that for each $x_i$ we have
\begin{equation}
x_i \approx c_{i,1} u_1 + c_{i,2} u_2 + \cdots + c_{i,s} u_s
\end{equation}
for some coefficients $c_{i,1},\ldots, c_{i,s}$.
These $N$ equations ($i$ goes from $1$ to $N$) can be combined into one single matrix equation:
\begin{equation}
X \approx U C
\end{equation}
for some matrix $C$. (Here the columns of $U$ are $u_1,\ldots, u_s$.)
Note that the rank of $UC$ is less than or equal to $s$. So $UC$ is a low rank approximation of $X$.
Here is the key fact: the SVD gives us an optimal low rank approximation of $X$ ! That is one of the basic facts about the SVD. That's why the SVD can be used for image compression.
If the SVD of $X$ is expressed as
\begin{equation}
X = \sum_{i=1}^N \sigma_i u_i v_i^T,
\end{equation}
where $\sigma_1 \geq \sigma_2 \geq \cdots \geq \sigma_N$,
then an optimal approximation of $X$ of rank less than or equal to $s$ is
\begin{align}
X &\approx \sum_{i=1}^s \sigma_i u_i v_i^T \\
&= U \Sigma V^T \\
&= U C
\end{align}
where $U$ is the matrix with columns $u_1,\ldots, u_s$ and $C = \Sigma V^T$.
Thus, the SVD finds an optimal $U$ for us.
PCA takes as input vectors $x_1,\ldots,x_N$ as well as a small positive integer $s$. PCA demeans the vectors and stores them in the columns of a matrix $X$, then simply computes the SVD $X = U \Sigma V^T$ and returns the first $s$ columns of $U$ as output.
Best Answer
Correction: $U$ is the matrix of eigenvectors of $AA^T$. If you want to understand this in general read on. Or if your interest in incidental of LSA, then skip to next section.
In SVD, we decompose a matrix $X$ three matrice each with useful properties.
$X = USV^T$. Here $U,V$ are unitary matrices: $U^T = U^{-1}$ and same for $V$. $S$ is a diagonal matrix.
The only property we need is that $U,V$ are unitary matrices.
I'm assuming real matrices every where.
Let's start with $XX^T$:
(please note the dimensions at each step as indicated)
$XX^T = USV^T (USV^T)^T$
$XX^T = USV^T_{n \times n} V_{n \times n}S^TU^T$
Here we use $V^TV = I$:
$XX^T = USI_{n \times n} S^T_{n \times m}U^T$
$I_{n \times n} S^T_{n \times m} = S^T$. Notice that $S$ is a diagonal matrix, so $SS^T$ is also diagonal:
$XX^T = U SS^T U^T$
$XX^T = U SS^T U^{-1}$
This is the same expression as the eigen-decomposition.
You can try a similar chain of proof for $X^TX$ to see that $V$ is a set of eigenvectors of $X^TX$.
Distance between documents
I think you mean "approximate" instead of "decompose."
SVD does not convert the words into vectors. The second part of your sentence is correct though: we can estimate distance between documents by using SVD and approximating it.
This is my understanding: Say you have a set of $n$ documents. To represent them in a mathematically, you strip out the common words such as
the, a, here, there
. In other words, these are words that you believe can appear in all these documents and have no significance.After this, you count the remaining words in the document and create a column vector, where each word has a score (the number of times it occurs in this document). So far you have represented $1$ document as a column vector.
Now you do this for all the $n$ documents. Now stack them side by side and you have a matrix, while making sure that each word gets a row. Call this matrix $C$.
At this point what you have is a massive matrix $C$ which is humanly (and computationally) difficult to handle. This is where SVD is useful. It can decompose this matrix into three matrices each of which has an useful property.
$C = USV^T$
Here $S$ is a diagonal matrix. The diagonal elements will have different values, but all positive in this case.
Now you can think of approximating. Say you notice that most diagonal elements of $S$ close to zero. You set them to zero. Now with this thresholding done, you can reconstruct a matrix $C'$ by multiplying the three matrices again.
Mathematically, this $C'$ is an approximation of $C$.
More importantly, after we set threshold the diagonal elements of $S$, we'll have several columns of $S$ that are entirely zero. This means the corresponding rows in $V^T$ are actually useless because they do not contribute to the approximated $C'$.
Let's drop those irrelevant rows from $V^T$ and call this new matrix $V_{truncated}^T$.
$V_{truncated}^T$ contains the same number of columns as $C$ but very few rows. This means we started with $n$ documents and maybe a million rows but now we have $n$ documents and only a handful of rows in $V^T_{truncated}$.
In fact we could threshold the diagonal elements harshly leading to only $2$ rows in our $V^T_{truncated}$.
The product $SV^T_{truncated}$ is a low dimensional representation of our original documents. At this point we could easily draw the low-dimensional documents as 2D points on a piece of paper. Close together points represent similar documents or as you put it "distance between documents."
Some more material
Some intuitive answers here. After that, this and this explain how approximating the SVD achieves in this application i.e. LSI.