Questions about norm of $\mathbf{x}$ when minimizing squared norm of $\mathbf{A} \mathbf{x} – \mathbf{b}$

least squareslinear algebramachine learningnormed-spacesoptimization

Section 4.5 Example: Linear Least Squares of the textbook Deep Learning by Goodfellow, Bengio, and Courville, says the following:

Suppose we want to find the value of $\mathbf{x}$ that minimizes

$$f(\mathbf{x}) = \dfrac{1}{2}||\mathbf{A} \mathbf{x} – \mathbf{b}||_2^2 \tag{4.21}$$

Specialized linear algebra algorithms can solve this problem efficiently; however, we can also explore how to solve it using gradient-based optimization as a simple example of how these techniques work.

First, we need to obtain the gradient:

$$\nabla_{\mathbf{x}} f(\mathbf{x}) = \mathbf{A}^T (\mathbf{A}\mathbf{x} – \mathbf{b}) = \mathbf{A}^T \mathbf{A} \mathbf{x} – \mathbf{A}^T \mathbf{b} \tag{4.22}$$

We can then follow this gradient downhill, taking small steps. See algorithm 4.1 for details.


Algorithm 4.1 An algorithm to minimise $f(\mathbf{x}) = \dfrac{1}{2}||\mathbf{A} \mathbf{x} – \mathbf{b}||_2^2$ with respect to $\mathbf{x}$ using gradient descent, starting form an arbitrary value of $\mathbf{x}$.


Set the step size ($\epsilon$) and tolerance ($\delta$) to small, positive numbers.

while $||\mathbf{A}^T \mathbf{A} \mathbf{x} – \mathbf{A}^T \mathbf{b}||_2 > \delta$ do

$\ \ \ \mathbf{x} \leftarrow \mathbf{x} – \epsilon(\mathbf{A}^T \mathbf{A} \mathbf{x} – \mathbf{A}^T \mathbf{b})$

end while


One can also solve this problem using Newton's method. In this case, because the true function is quadratic, the quadratic approximation employed by Newton's method is exact, and the algorithm converges to the global minimum in a single step.

Now suppose we wish to minimize the same function, but subject to the constraint $\mathbf{x}^T \mathbf{x} \le 1$. To do so, we introduce the Lagrangian

$$L(\mathbf{x}, \lambda) = f(\mathbf{x}) + \lambda (\mathbf{x}^T \mathbf{x} – 1). \tag{4.23}$$

We can now solve the problem

$$\min_{\mathbf{x}} \max_{\lambda, \lambda \ge 0} L(\mathbf{x}, \lambda)$$

The smallest-norm solution to the unconstrained least-squares problem may be found using the Moore-Penrose pseudoinverse: $\mathbf{x} = \mathbf{A}^+ \mathbf{b}$. If this point is feasible, then it is the solution to the constrained problem. Otherwise, we must find a solution where the constraint is active. By differentiating the Lagrangian with respect to $\mathbf{x}$, we obtain the equation

$$\mathbf{A}^T \mathbf{A} \mathbf{x} – \mathbf{A}^T \mathbf{b} + 2 \lambda \mathbf{x} = 0 \tag{4.25}$$

This tells us that the solution will take the form

$$\mathbf{x} = (\mathbf{A}^T \mathbf{A} + 2 \lambda \mathbf{I})^{-1} \mathbf{A}^T \mathbf{b} \tag{4.26}$$

The magnitude $\lambda$ must be chosen such that the result obeys the constraints. We can find this value by performing gradient ancient on $\lambda$. To do so, observe

$$\dfrac{\partial}{\partial{\lambda}} L(\mathbf{x}, \lambda) = \mathbf{x}^T \mathbf{x} – 1 \tag{4.27}$$

When the norm of $\mathbf{x}$ exceeds $1$, this derivative is positive, so to follow the derivative uphill and increase the Lagrangian with respect to $\lambda$, we increase $\lambda$. Because the coefficient on the $\mathbf{x}^T \mathbf{x}$ penalty has increased, solving the linear equation for $\mathbf{x}$ will now yield a solution with a smaller norm. The process of solving the linear equation and adjusting $\lambda$ continues until $\mathbf{x}$ has the correct norm and the derivative is $0$.

My questions here relate to the norm, and are similar to those that I asked here.

  1. At the beginning of this section, the authors reference the norm of $\mathbf{A} \mathbf{x} – \mathbf{b}$. However, at the end of the section, the norm of $\mathbf{x}$, rather than $\mathbf{A} \mathbf{x} – \mathbf{b}$, seemingly comes out of nowhere. Similar to my questions referenced above (in the other thread), where did the norm of $\mathbf{x}$ come from?

  2. My understanding is that the $\mathbf{x}^T \mathbf{x}$ "penalty" that the authors are referencing at the end here is the term $\lambda (\mathbf{x}^T \mathbf{x} – 1)$ in $L(\mathbf{x}, \lambda) = f(\mathbf{x}) + \lambda (\mathbf{x}^T \mathbf{x} – 1)$; would that be correct? If so, then why does the coefficient on the $\mathbf{x}^T \mathbf{x}$ penalty increasing necessitate that solving the linear equation for $\mathbf{x}$ now yields a solution with a smaller norm?

  3. What is meant by "correct" norm in this last part?

I would greatly appreciate it if people would please take the time to clarify these points.

Best Answer

  1. The constraint $x^T x \leq 1$ implies that the norm of $x$ is less than $1$. That is $x^Tx =||x||^2$.

  2. Yes, you are correct. And if the norm of $x$ is greater than $1$, the $x^T x-1$ term is positive. The max over $\lambda$ is achieved at $\lambda = \infty$ with infinite value. Therefore, when you take the minimum over $x$, any solution will certainly satisfy the constraint $x^Tx\leq 1$, because otherwise, the expression is $+\infty$ and cannot be a minimum.

  3. The correct norm is the norm that satisfied the condition, that is $x^Tx =||x||^2 \leq 1$.