[Math] Moore-Penrose pseudoinverse and the Euclidean norm

linear algebramachine learningmatricesnormed-spacespseudoinverse

Section 2.9 The Moore-Penrose Pseudoinverse of the textbook Deep Learning by Goodfellow, Bengio, and Courville, says the following:

Matrix inversion is not defined for matrices that are not square. Suppose we want to make a left-inverse $\mathbf{B}$ of a matrix $\mathbf{A}$ so that we can solve a linear equation

$$\mathbf{A} \mathbf{x} = \mathbf{y} \tag{2.44}$$

by left-multiplying each side to obtain

$$\mathbf{x} = \mathbf{B} \mathbf{y}. \tag{2.45}$$

Depending on the structure of the problem, it may not be possible to design a unique mapping from $\mathbf{A}$ to $\mathbf{B}$.

If $\mathbf{A}$ is taller than it is wide, then it is possible for this equation to have no solution. If $\mathbf{A}$ is wider than it is tall, then there could be multiple possible solutions. The Moore-Penrose pseudoinverse enables us to make some headway in these cases. The pseudoinverse of $\mathbf{A}$ is defined as a matrix

$$\mathbf{A}^+ = \lim_{\alpha \searrow 0^+}(\mathbf{A}^T \mathbf{A} + \alpha \mathbf{I} )^{-1} \mathbf{A}^T. \tag{2.46}$$

Practical algorithms for computing the pseudoinverse are based not on this definition, but rather on the formula

$$\mathbf{A}^+ = \mathbf{V} \mathbf{D}^+ \mathbf{U}^T, \tag{2.47}$$

where $\mathbf{U}$, $\mathbf{D}$ and $\mathbf{V}$ are the singular value decomposition of $\mathbf{A}$, and the pseudoinverse $\mathbf{D}^+$ of a diagonal matrix $\mathbf{D}$ is obtained by taking the reciprocal of its nonzero elements then taking the transpose of the resulting matrix.

When $\mathbf{A}$ has more columns than rows, then solving a linear equation using the pseudoinverse provides one of the many possible solutions. Specifically, it provides the solution $\mathbf{x} = \mathbf{A}^+ \mathbf{y}$ with minimal Euclidean norm $\vert \vert \mathbf{x} \vert \vert_2$ among all possible solutions.

When $\mathbf{A}$ has more rows than columns, it is possible for there to be no solution. In this case, using the pseudoinverse gives us the $\mathbf{x}$ for which $\mathbf{A} \mathbf{x}$ is as close as possible to $\mathbf{y}$ in terms of Euclidean norm $\vert \vert \mathbf{A} \mathbf{x} − \mathbf{y} \vert \vert_2$.

It's this last part that I'm wondering about:

When $\mathbf{A}$ has more columns than rows, then solving a linear equation using the pseudoinverse provides one of the many possible solutions. Specifically, it provides the solution $\mathbf{x} = \mathbf{A}^+ \mathbf{y}$ with minimal Euclidean norm $\vert \vert \mathbf{x} \vert \vert_2$ among all possible solutions.

When $\mathbf{A}$ has more rows than columns, it is possible for there to be no solution. In this case, using the pseudoinverse gives us the $\mathbf{x}$ for which $\mathbf{A} \mathbf{x}$ is as close as possible to $\mathbf{y}$ in terms of Euclidean norm $\vert \vert \mathbf{A} \mathbf{x} − \mathbf{y} \vert \vert_2$.

What I found confusing here is that the Euclidean norms $\vert \vert \mathbf{x} \vert \vert_2$ and $\vert \vert \mathbf{A} \mathbf{x} − \mathbf{y} \vert \vert_2$ seemingly come out of nowhere. Prior to this section, there is no discussion of the Euclidean norm — only of the mechanics of the Moore-Penrose Pseudoinverse. And the authors then just assert this part without explanation.

So I am left wondering the following:

  1. Why is it that, when $\mathbf{A}$ has more columns than rows, then using the pseudoinverse gives us the solution $\mathbf{x} = \mathbf{A}^+ \mathbf{y}$ with minimal Euclidean norm $\vert \vert \mathbf{x} \vert \vert_2$ among all possible solutions?

  2. Why is it that, when $\mathbf{A}$ has more rows than columns, then using the pseudoinverse gives us the $\mathbf{x}$ for which $\mathbf{A} \mathbf{x}$ is as close as possible to $\mathbf{y}$ in terms of Euclidean norm $\vert \vert \mathbf{A} \mathbf{x} − \mathbf{y} \vert \vert_2$?

And what are the mechanics involved here?

I would greatly appreciate it if people would please take the time to clarify this.

Best Answer

Eqn. (2.46) proposes to look at the minimizer $x_\alpha$ of the functional $$J_\alpha(x) := |A x - y|^2 + \alpha |x|^2.$$ For any finite $\alpha > 0$, the functional is strictly convex and has a unique minimizer $x_\alpha$; it is the smallest among those $x$ that produce the same residual magnitude $|A x - y|$. Minimization wrt $x$ gives $x_\alpha = (A^\top A + \alpha I)^{-1} A^\top y$. To see this, write the norm $|\cdot|^2$ in terms of the scalar product $\langle \cdot, \cdot \rangle$.

Ad 1. Suppose $A x = y$ has a solution $x^*$. The set of solutions is the convex set $(x^* + \ker A)$. So, there is only one solution that has minimal norm: the orthogonal projection of $0$ onto that set. As $\alpha \searrow 0$, the residual term becomes more important, and $A x = y$ is eventually enforced. Therefore, $x_0 := \lim_{\alpha \searrow 0} x_\alpha$ is the minimal-norm solution of $A x = y$.

Ad 2. If $A x = y$ has no solution, the residual $|A x - y|$ still has a minimum, which is selected for in the limit $\alpha \searrow 0$.

Related Question