You can indeed store the information with less data. The SVD gives the best $k$-rank approximation of the image. If we take the SVD of the image $I$, then $I = USV$, where $U$ and $V$ are the left and right singular vectors and $S$ is the matrix of singular values. We can then make the $k$-rank approximation by
$I = \sum\limits_{n=1}^k u_n \sigma_n v_n^T.$
WE can then make a good approximation with a fraction of the data. For example here's a $512$ $\times$ $512$ image of Lena.
And here are a few low rank approximations to the image. You can see by $k=50$, we are getting something very similar visually which contains only $51250$ bit of information compared to $512^2$, just less than a quarter of the data.
Suppose we have a bunch of large vectors $x_1,\ldots,x_N$ stored as the columns of a matrix $X$. It would be nice if we could somehow find a small number of vectors $u_1,\ldots,u_s$ such that each vector $x_i$ is (to a good approximation) equal to a linear combination of the vectors $u_1,\ldots, u_s$. This would allow us to describe each of the (very large) vectors $x_i$ using just a small number of coefficients.
So we want to find vectors $u_1,\ldots, u_s$ such that for each $x_i$ we have
\begin{equation}
x_i \approx c_{i,1} u_1 + c_{i,2} u_2 + \cdots + c_{i,s} u_s
\end{equation}
for some coefficients $c_{i,1},\ldots, c_{i,s}$.
These $N$ equations ($i$ goes from $1$ to $N$) can be combined into one single matrix equation:
\begin{equation}
X \approx U C
\end{equation}
for some matrix $C$. (Here the columns of $U$ are $u_1,\ldots, u_s$.)
Note that the rank of $UC$ is less than or equal to $s$. So $UC$ is a low rank approximation of $X$.
Here is the key fact: the SVD gives us an optimal low rank approximation of $X$ ! That is one of the basic facts about the SVD. That's why the SVD can be used for image compression.
If the SVD of $X$ is expressed as
\begin{equation}
X = \sum_{i=1}^N \sigma_i u_i v_i^T,
\end{equation}
where $\sigma_1 \geq \sigma_2 \geq \cdots \geq \sigma_N$,
then an optimal approximation of $X$ of rank less than or equal to $s$ is
\begin{align}
X &\approx \sum_{i=1}^s \sigma_i u_i v_i^T \\
&= U \Sigma V^T \\
&= U C
\end{align}
where $U$ is the matrix with columns $u_1,\ldots, u_s$ and $C = \Sigma V^T$.
Thus, the SVD finds an optimal $U$ for us.
PCA takes as input vectors $x_1,\ldots,x_N$ as well as a small positive integer $s$. PCA demeans the vectors and stores them in the columns of a matrix $X$, then simply computes the SVD $X = U \Sigma V^T$ and returns the first $s$ columns of $U$ as output.
Best Answer
Ok - Ill give it a shot. Let me start from the top and try to recap PCA, and then show the connection to SVD.
Recall: For PCA, We begin with a centered $M$ (dim = $(n,d)$) of data. For this data, we compute the sample covariance : $S = \frac{1}{n-1}M^TM$ where $n$ is the number of data points.
For this covariance matrix we find the eigenvectors and eigenvalues. Corresponding to the largest eigenvalues we select $l$ eigenvectors. Lets call the matrix consisting of these eigenvectors $W$.
$W$ will have dimensions $d \times l$ . Then we can write $Z = MW$ and we understand that each row of Z is a lower dimensional embedding of $m_i$ a row of $M$.
OK now suppose we can write $M = U S V^T$. Then we notice that :
\begin{align} M^TM &= VS^TU^TUSV^T\\ &= V(S^TS)V^T \text{(since U is orthonormal)} \\ &= VDV^T \end{align} Where $D = S^2$ is a diagonal matrix containing the squares of singular values.
Thus we have that $(M^TM)V = VD$ since $V$ is also orthonormal. Aha! So we can see that the columns of V are the eigenvectors of $M^TM$ and $D$ contains the eigenvalues. This $V$ is precisely what we called $W$ above.
Hope that clears things up a bit. Sorry for the delay :)