[Math] Geometrical interpretations of SVD

geometryintuitionlinear algebrasvd

I'm a bit confused by the various geometrical/visual interpretations of SVD or better I'm wondering how to reconcile them.

Transformations : As explained here, the 3 matrices produced by the SVD can be interpreted as rotation, scaling, rotation.
Projection on an axis (dimensionality reduction): As explained here, SVD enables to capture the most of the variance in the data by selecting the right orthogonal axes.

My (very naive) question is: are these 2 interpretations related visually or not at all, and if yes, how? Or are they 2 unrelated applications of SVD?

As far as I understand, the transformations of the matrices (1) can be applied to anything/any object, so it's not about the data mentioned in (2) so I would say it's not related but I may be wrong (because in (2) we also talk about rotations…).

Please note that I'm looking for the intuition with as few math as possible (I don't know much linear algebra as you may have guessed).
Many thanks.

Best Answer

A matrix represents a linear transformation that rotates, scales and shears whatever you put into it; so feeding the coordinates of a square could potentially give you, say, a parallelogram. An important fact is that there is a one-to-one correspondence between all real matrices and linear transformations: if you can think of a linear transformation, then there is a way to write it as a matrix.

SVD is based on a theorem that says any matrix $\mathbf A$ can be written in the form $\mathbf{U\Sigma V}^T$ where $\mathbf U$ and $\mathbf V$ are strictly rotations and $\mathbf \Sigma$ is a matrix that scales. So, any linear transformation can be broken down into 3 steps, i.e. rotate first, stretch/scale (not necessarily by the same amount in all directions; you could stretch the x-axis twice as much as the y-axis), and rotate again.

For instance, to transform a square into a paralellogram, you could rotate clockwise by $\theta$ (the value of this is not too important as long as you pick a sensible number as the rotation matrices are not unique), scale the axes by different factors, then rotate counter-clockwise again by $\theta$.

Points 1 and 2 are related in the following way: a projection (point 2) is a 'simplified' transformation. Suppose you had a transformation that changes a 1x1 square into a 10x0.1 rectangle. A projection would be to simply say that this transformation changes the square into a 10x0 'rectangle' (which is a line). This is dimensionality reduction: your 2-dimension square is projected onto a 1-dimensional line. If you did an SVD with this, $\mathbf U$ and $\mathbf V$ would be the identity matrices, and $\mathbf \Sigma$ would be a diagonal matrix (as it always is) with entries 10 and 0.1.

The key point to understanding the dimensionality reduction part is to completely forget about rotations: by the SVD decomposition theorem, rotations are irrelevant and can be 'added' in later or earlier; you only want to know the way in things scale (along different axes), so the SVD helps you strip away the rotation: a matrix that turns a square into a parallelogram can be seen as something that scales a square into a rectangle (between two rotations). Having something scale to a small value (relative to everything else) means that you can pretend it scales to zero, which in the context of transformations, is a projection and approximates the original transformation.

To summarise the answer to your question: when your transformation is just scaling, and one of the scales is relatively small, you can replace the smallest scale factor with zero, and this gives you a projection. SVD tells you that all transformations can be expressed as a scaling between two rotations, and the idea of dimensionality reduction is to replace the scaling with a projection. 'Selecting the right axes' refers to the rotation: you want to 'project away' only once you are sure that you lose as little as possible by first rotating your shape (or data).

Related Solutions

Linear Algebra – Interpreting and Understanding Matrix Multiplication

Some comments first. There are several serious confusions in what you write. For example, in the third paragraph, having seen that the entries of $AB$ are obtained by taking the dot product of the corresponding row of $A$ with column of $B$, you write that you view $AB$ as a dot product of rows of $B$ and rows of $A$. It's not.

For another example, you talk about matrix multiplication "happening". Matrices aren't running wild in the hidden jungles of the Amazon, where things "happen" without human beings. Matrix multiplication is defined a certain way, and then the definition is why matrix multiplication is done the way it is done. You may very well ask why matrix multiplication is defined the way it is defined, and whether there are other ways of defining a "multiplication" on matrices (yes, there are; read further), but that's a completely separate question. "Why does matrix multiplication happen the way it does?" is pretty incoherent on its face.

Another example of confusion is that not every matrix corresponds to a "change in reference system". This is only true, viewed from the correct angle, for invertible matrices.

Standard matrix multiplication. Matrix multiplication is defined the way it is because it corresponds to composition of linear transformations. Though this is valid in extremely great generality, let's focus on linear transformations $T\colon \mathbb{R}^n\to\mathbb{R}^m$. Since linear transformations satisfy $T(\alpha\mathbf{x}+\beta\mathbf{y}) = \alpha T(\mathbf{x})+\beta T(\mathbf{y})$, if you know the value of $T$ at each of $\mathbf{e}_1,\ldots,\mathbf{e}_n$, where $\mathbf{e}^n_i$ is the (column) $n$-vector that has $0$s in each coordinate except the $i$th coordinate where it has a $1$, then you know the value of $T$ at every single vector of $\mathbb{R}^n$.

So in order to describe the value of $T$, I just need to tell you what $T(\mathbf{e}_i)$ is. For example, we can take $$T(\mathbf{e}_i) = \left(\begin{array}{c}a_{1i}\\a_{2i}\\ \vdots\\ a_{mi}\end{array}\right).$$ Then, since $$\left(\begin{array}{c}k_1\\k_2\\ \vdots\\k_n\end{array}\right) = k_1\mathbf{e}_1 + \cdots +k_n\mathbf{e}_n,$$ we have $$T\left(\begin{array}{c}k_1\\k_2\\ \vdots\\ k_n\end{array}\right) = k_1T(\mathbf{e}_1) + \cdots +k_nT(\mathbf{e}_n) = k_1\left(\begin{array}{c}a_{11}\\a_{21}\\ \vdots\\a_{m1}\end{array}\right) + \cdots + k_n\left(\begin{array}{c}a_{1n}\\a_{2n}\\ \vdots\\ a_{mn}\end{array}\right).$$

It is very fruitful, then to keep track of the $a_{ij}$ in some way, and given the expression above, we keep track of them in a matrix, which is just a rectangular array of real numbers. We then think of $T$ as being "given" by the matrix $$\left(\begin{array}{cccc} a_{11} & a_{12} & \cdots & a_{1n}\\ a_{21} & a_{22} & \cdots & a_{2n}\\ \vdots & \vdots & \ddots & \vdots\\ a_{m1} & a_{m2} & \cdots & a_{mn} \end{array}\right).$$ If we want to keep track of $T$ this way, then for an arbitrary vector $\mathbf{x} = (x_1,\ldots,x_n)^t$ (the ${}^t$ means "transpose"; turn every rown into a column, every column into a row), then we have that $T(\mathbf{x})$ corresponds to: $$\left(\begin{array}{cccc} a_{11} & a_{12} & \cdots & a_{1n}\\ a_{21} & a_{22} & \cdots & a_{2n}\\ \vdots & \vdots & \ddots & \vdots\\ a_{m1} & a_{m2} & \cdots & a_{mn} \end{array}\right) \left(\begin{array}{c} x_1\\x_2\\ \vdots\\ x_n\end{array}\right) = \left(\begin{array}{c} a_{11}x_1 + a_{12}x_2 + \cdots + a_{1n}x_n\\ a_{21}x_1 + a_{22}x_2 + \cdots + a_{2n}x_n\\ \vdots\\ a_{m1}x_1 + a_{m2}x_2 + \cdots + a_{mn}x_n \end{array}\right).$$

What happens when we have two linear transformations, $T\colon \mathbb{R}^n\to\mathbb{R}^m$ and $S\colon\mathbb{R}^p\to\mathbb{R}^n$? If $T$ corresponds as above to a certain $m\times n$ matrix, then $S$ will likewise correspond to a certain $n\times p$ matrix, say $$\left(\begin{array}{cccc} b_{11} & b_{12} & \cdots & b_{1p}\\ b_{21} & b_{22} & \cdots & b_{2p}\\ \vdots & \vdots & \ddots & \vdots\\ b_{n1} & b_{n2} & \cdots & b_{np} \end{array}\right).$$ What is $T\circ S$? First, it is a linear transformation because composition of linear transformations yields a linear transformation. Second, it goes from $\mathbb{R}^p$ to $\mathbb{R}^m$, so it should correspond to an $m\times p$ matrix. Which matrix? If we let $\mathbf{f}_1,\ldots,\mathbf{f}_p$ be the (column) $p$-vectors given by letting $\mathbf{f}_j$ have $0$s everywhere and a $1$ in the $j$th entry, then the matrix above tells us that $$S(\mathbf{f}_j) = \left(\begin{array}{c}b_{1j}\\b_{2j}\\ \vdots \\b_{nj}\end{array}\right) = b_{1j}\mathbf{e}_1+\cdots + b_{nj}\mathbf{e}_n.$$

So, what is $T\circ S(\mathbf{f}_j)$? This is what goes in the $j$th column of the matrix that corresponds to $T\circ S$. Evaluating, we have: \begin{align*} T\circ S(\mathbf{f}_j) &= T\Bigl( S(\mathbf{f}_j)\Bigr)\\\ &= T\Bigl( b_{1j}\mathbf{e}_1 + \cdots + b_{nj}\mathbf{e}_n\Bigr)\\\ &= b_{1j} T(\mathbf{e}_1) + \cdots + b_{nj}T(\mathbf{e}_n)\\\ &= b_{1j}\left(\begin{array}{c} a_{11}\\\ a_{21}\\\ \vdots\\\ a_{m1}\end{array}\right) + \cdots + b_{nj}\left(\begin{array}{c} a_{1n}\\a_{2n}\\\ \vdots\\\ a_{mn}\end{array}\right)\\\ &= \left(\begin{array}{c} a_{11}b_{1j} + a_{12}b_{2j} + \cdots + a_{1n}b_{nj}\\\ a_{21}b_{1j} + a_{22}b_{2j} + \cdots + a_{2n}b_{nj}\\\ \vdots\\\ a_{m1}b_{1j} + a_{m2}b_{2j} + \cdots + a_{mn}b_{nj} \end{array}\right). \end{align*} So if we want to write down the matrix that corresponds to $T\circ S$, then the $(i,j)$th entry will be $$a_{i1}b_{1j} + a_{i2}b_{2j} + \cdots + a_{in}b_{nj}.$$ So we define the "composition" or product of the matrix of $T$ with the matrix of $S$ to be precisely the matrix of $T\circ S$. We can make this definition without reference to the linear transformations that gave it birth: if the matrix of $T$ is $m\times n$ with entries $a_{ij}$ (let's call it $A$); and the matrix of $S$ is $n\times p$ with entries $b_{rs}$ (let's call it $B$), then the matrix of $T\circ S$ (let's call it $A\circ B$ or $AB$) is $m\times p$ and with entries $c_{k\ell}$, where $$c_{k\ell} = a_{k1}b_{1\ell} + a_{k2}b_{2\ell} + \cdots + a_{kn}b_{n\ell}$$ by definition. Why? Because then the matrix of the composition of two functions is precisely the product of the matrices of the two functions. We can work with the matrices directly without having to think about the functions.

In point of fact, there is nothing about the dot product which is at play in this definition. It is essentially by happenstance that the $(i,j)$ entry can be obtained as a dot product of something. In fact, the $(i,j)$th entry is obtained as the matrix product of the $1\times n$ matrix consisting of the $i$th row of $A$, with the $n\times 1$ matrix consisting of the $j$th column of $B$. Only if you transpose this column can you try to interpret this as a dot product. (In fact, the modern view is the other way around: we define the dot product of two vectors as a special case of a more general inner product, called the Frobenius inner product, which is defined in terms of matrix multiplication, $\langle\mathbf{x},\mathbf{y}\rangle =\mathrm{trace}(\overline{\mathbf{y}^t}\mathbf{x})$).

And because product of matrices corresponds to composition of linear transformations, all the nice properties that composition of linear functions has will automatically also be true for product of matrices, because products of matrices is nothing more than a book-keeping device for keeping track of the composition of linear transformations. So $(AB)C = A(BC)$, because composition of functions is associative. $A(B+C) = AB + AC$ because composition of linear transformations distributes over sums of linear transformations (sums of matrices are defined entry-by-entry because that agrees precisely with the sum of linear transformations). $A(\alpha B) = \alpha(AB) = (\alpha A)B$, because composition of linear transformations behaves that way with scalar multiplication (products of matrices by scalar are defined the way they are precisely so that they will correspond to the operation with linear transformations).

So we define product of matrices explicitly so that it will match up composition of linear transformations. There really is no deeper hidden reason. It seems a bit incongruous, perhaps, that such a simple reason results in such a complicated formula, but such is life.

Another reason why it is somewhat misguided to try to understand matrix product in terms of dot product is that the matrix product keeps track of all the information lying around about the two compositions, but the dot product loses a lot of information about the two vectors in question. Knowing that $\mathbf{x}\cdot\mathbf{y}=0$ only tells you that $\mathbf{x}$ and $\mathbf{y}$ are perpendicular, it doesn't really tell you anything else. There is a lot of informational loss in the dot product, and trying to explain matrix product in terms of the dot product requires that we "recover" all of this lost information in some way. In practice, it means keeping track of all the original information, which makes trying to shoehorn the dot product into the explanation unnecessary, because you will already have all the information to get the product directly.

Examples that are not just "changes in reference system". Note that any linear transformation corresponds to a matrix. But the only linear transformations that can be thought of as "changes in perspective" are the linear transformations that map $\mathbb{R}^n$ to itself, and which are one-to-one and onto. There are lots of linear transformations that aren't like that. For example, the linear transformation $T$ from $\mathbb{R}^3$ to $\mathbb{R}^2$ defined by $$T\left(\begin{array}{c} a\\b\\c\end{array}\right) = \left(\begin{array}{c}b\\2c\end{array}\right)$$ is not a "change in reference system" (because lots of nonzero vectors go to zero, but there is no way to just "change your perspective" and start seeing a nonzero vector as zero) but is a linear transformation nonetheless. The corresponding matrix is $2\times 3$, and is $$\left(\begin{array}{cc} 0 & 1 & 0\\ 0 & 0 & 2 \end{array}\right).$$ Now consider the linear transformation $U\colon\mathbb{R}^2\to\mathbb{R}^2$ given by $$U\left(\begin{array}{c}x\\y\end{array}\right) = \left(\begin{array}{c}3x+2y\\ 9x + 6y\end{array}\right).$$ Again, this is not a "change in perspective", because the vector $\binom{2}{-3}$ is mapped to $\binom{0}{0}$. It has a matrix, $2\times 2$, which is $$\left(\begin{array}{cc} 3 & 2\\ 9 & 6 \end{array}\right).$$ So the composition $U\circ T$ has matrix: $$\left(\begin{array}{cc} 3 & 2\\ 9 & 6 \end{array}\right) \left(\begin{array}{ccc} 0 & 1 & 0\\ 0 & 0 & 2 \end{array}\right) = \left(\begin{array}{ccc} 0 & 3 & 4\\ 0 & 9 & 12 \end{array}\right),$$ which tells me that $$U\circ T\left(\begin{array}{c}x\\y\\z\end{array}\right) = \left(\begin{array}{c} 3y + 4z\\ 9y+12z\end{array}\right).$$

Other matrix products. Are there other ways to define the product of two matrices? Sure. There's the Hadamard product, which is the "obvious" thing to try: you can multiply two matrices of the same size (and only of the same size), and you do it entry by entry, just the same way that you add two matrices. This has some nice properties, but it has nothing to do with linear transformations. There's the Kronecker product, which takes an $m\times n$ matrix times a $p\times q$ matrix and gives an $mp\times nq$ matrix. This one is associated to the tensor product of linear transformations. They are defined differently because they are meant to model other operations that one does with matrices or vectors.

[Math] interpretation of SVD for text mining topic analysis

The topic of low rank approximation is sprinkled throughout Math SE:

Low-rank Approximation with SVD on a Kernel Matrix

Matrix values increasing after SVD, singular value decomposition

The singular value spectrum may span several orders of magnitude. It seems natural that the contributions from the larger values are more important. Numerically, it is difficult to tell whether small singular values are valid or simply machine noise in computing a $0$ singular value. This requires a threshhold to determine which singular values are discarded.

Let's look at the SVD in detail.

Singular Value Decomposition

Every matrix $$ \mathbf{A} \in \mathbb{C}^{m\times n}_{\rho} $$ has a singular value decomposition of the form $$ \begin{align} \mathbf{A} &= \mathbf{U} \, \Sigma \, \mathbf{V}^{*} \\ % &= % U \left[ \begin{array}{cc} \color{blue}{\mathbf{U}_{\mathcal{R}}} & \color{red}{\mathbf{U}_{\mathcal{N}}} \end{array} \right] % Sigma \left[ \begin{array}{cccc|cc} \sigma_{1} & 0 & \dots & & & \dots & 0 \\ 0 & \sigma_{2} \\ \vdots && \ddots \\ & & & \sigma_{\rho} \\\hline & & & & 0 & \\ \vdots &&&&&\ddots \\ 0 & & & & & & 0 \\ \end{array} \right] % V \left[ \begin{array}{c} \color{blue}{\mathbf{V}_{\mathcal{R}}}^{*} \\ \color{red}{\mathbf{V}_{\mathcal{N}}}^{*} \end{array} \right] \\ % & = % U \left[ \begin{array}{cccccccc} \color{blue}{u_{1}} & \dots & \color{blue}{u_{\rho}} & \color{red}{u_{\rho+1}} & \dots & \color{red}{u_{n}} \end{array} \right] % Sigma \left[ \begin{array}{cc} \mathbf{S}_{\rho\times \rho} & \mathbf{0} \\ \mathbf{0} & \mathbf{0} \end{array} \right] % V \left[ \begin{array}{c} \color{blue}{v_{1}^{*}} \\ \vdots \\ \color{blue}{v_{\rho}^{*}} \\ \color{red}{v_{\rho+1}^{*}} \\ \vdots \\ \color{red}{v_{n}^{*}} \end{array} \right] % \end{align} $$

The connection to the row and column spaces follows: $$ \begin{align} % R A \color{blue}{\mathcal{R} \left( \mathbf{A} \right)} &= \text{span} \left\{ \color{blue}{u_{1}}, \dots , \color{blue}{u_{\rho}} \right\} \\ % R A* \color{blue}{\mathcal{R} \left( \mathbf{A}^{*} \right)} &= \text{span} \left\{ \color{blue}{v_{1}}, \dots , \color{blue}{v_{\rho}} \right\} \\ % N A* \color{red}{\mathcal{N} \left( \mathbf{A}^{*} \right)} &= \text{span} \left\{ \color{red}{u_{\rho+1}}, \dots , \color{red}{u_{m}} \right\} \\ % N A \color{red}{\mathcal{N} \left( \mathbf{A} \right)} &= \text{span} \left\{ \color{red}{v_{\rho+1}}, \dots , \color{red}{v_{n}} \right\} \\ % \end{align} $$ You are using is $\mathbf{S} \, \color{blue}{\mathbf{V}_{\mathcal{R}}}^{*}$. This ignores the null space contributions in red.

A rank $\rho = 3$ approximation would look like this; $$ \mathbf{S}_{3} \, \color{blue}{\mathbf{V}_{\mathcal{R}}}^{*} = \left[ \begin{array}{cccc|cc} \sigma_{1} & 0 & 0 \\ 0 & \sigma_{2} & 0 \\ 0 & 0 & \sigma_{3} \\ \end{array} \right] % % V \left[ \begin{array}{c} \color{blue}{v_{1}^{*}} \\ \color{blue}{v_{2}^{*}} \\ \color{blue}{v_{3}^{*}} \\ \end{array} \right] % \in \mathbb{C}^{\rho \times n} $$

The following sequence shows the Koch snowflake fractals and their singular value spectra. As the object becomes more detailed, the spectrum becomes richer.

Best Answer

Related Solutions

Linear Algebra – Interpreting and Understanding Matrix Multiplication

[Math] interpretation of SVD for text mining topic analysis

Related Question