Solved – Gradient descent of $f(w)=\frac12w^TAw-b^Tw$ viewed in the space of Eigenvectors of $A$

deep learningmachine learningoptimization

I'm reading Why Momentum Really Works, a post from the new distill journal. I'll paraphrase the main equations leading to the part which confuses me, the post describes the intuition in more detail.

The gradient descent algorithm is given by the following iterative process
$$w^{k+1} = w^k-\alpha \nabla f(w^k)$$
where $w^k$ is the value of iteration $k$, the learning rate is $\alpha$ and $\nabla f(w)$ is the gradient of the function $f$ evaluated at $w$. The function $f$ you wish to minimize.

Gradient descent with momentum is given by adding "memory" to the descent, this is described by the pair of equations:

\begin{align}
z^{k+1} &= \beta z^k + \nabla f(w^k) \\
w^{k+1} &= w^k – \alpha z^{k+1}
\end{align}

In the next section "First Steps: Gradient Descent" the author considers a convex quadratic function
$$f(w) = \frac12w^TAw-b^Tw, \quad w \in \mathbb{R}^n, A \in \mathbb{R}^{n,n}$$
which has gradient $$\nabla f(w) = Aw-b$$
If we assume $A$ is symmetric and invertable then $f$ has optimal solution $w^\star = A^{-1}b$.

If we were to use gradient descent then we would iterate towards this optimal solution in the following way
\begin{align}
w^{k+1} &= w^k – \alpha \nabla f(w) \\
&= w^k – \alpha (Aw^k -b)
\end{align}

Then the article goes on to say "There is a very natural space to view gradient descent where all the dimensions act independently — the eigenvectors of $A$". I think this makes sense, although my intuition is kind of fuzzy.

Every symmetric matrix $A$ has an eigenvalue decomposition where $$A = Q\text{diag}(\lambda_1,\ldots,\lambda_n)Q^T.$$

Where $\lambda_1 > \ldots > \lambda_n$ and $Q$ is the vector with the corresponding eigenvectors as columns (right?).

This next part is where I don't understand what is going on:

If we perform a change of basis, $x^k = Q^T(w^k – w^\star)$, the
iterations break apart, becoming:

\begin{align} x_i^{k+1} &= x_i^k – \alpha \lambda_i x_i^k \\
&=(1-\alpha\lambda_i)x_i^k &= (1- \alpha\lambda_i)^{k+1}x_i^0
\end{align}

Moving back to our original space $w$, we can see that

$$w^k – w^\star = Qx^k = \sum\limits_{i}^n = x_i^0(1-\alpha\lambda_i)^kq_i$$

What is going on here? Where is the motivation of taking $w^k – w^\star$ into the eigendomain? What is $x^k$? Why are we now looking at invidual elements of the vector? I tried to follow the caculations through, but $x^{k+1}$ depends on $w^{k+1}$ which depends on $z^k$, which I thought we were trying to eliminate. My question is can someone expand on these few steps with some intuition and calculations? Thanks.

Best Answer

In many mathematical applications, the motivation becomes clearer after deriving the result. So let's start off with the algebra.

Suppose we were to run GD for $T$ iterations. This will give us the set ${(w_k)}_{k=1}^T$.

Let's do a change of basis:

$w^k = Qx^k + w^*$ $\iff$ $x^k = Q^T(w^k-w^*) $

Now we have ${(x_k)}_{k=1}^T$. What can we say about them? Let's look at each coordinate separately. By substituting the above and using the update step of GD,

$x_i^{k+1}= (Q^T(w^{k+1}-w^*))_i = (Q^T(w^k-\alpha (Aw^k-b)-w^*))_i $

Arranging,

$x_i^{k+1}=(Q^T(w^k-w^*))_i-\alpha \cdot (Q^T(Aw^k-b))_i$

The first term is exactly $x_i^k$. For the second term, we substitute $A=Qdiag(\lambda _1 \dots \lambda _n)Q^T$. This yields,

$x_i^{k+1}=x_i^k-\alpha \lambda _i x_i^k=(1-\alpha \lambda _i)x_i^k$

Which was a single step. Repeating until we get all the way to $x_0$, we get

$x_i^{k+1}=(1-\alpha \lambda _i)^{k+1}x_i^0$

All this seems really useless at this point. Let's go back to our initial concern, the ${w}$s. From our original change of basis, we know that $w^k-w^*=Qx^k$. Another way of writing the multiplication of the matrix $Q$ by the vector $x^k$ is as $\sum_i x_i^kq_i$. But we've shown above that $x_i^{k}=(1-\alpha \lambda _i)^{k}x_i^0$. Plugging everything together, we have obtained the desired "closed form" formula for the GD update step:

$w^k-w^*=\sum_i x_i^0(1-\alpha \lambda _i)^{k} q_i$

This is essentially an expression for the "error" at iteration $k$ of GD (how far we are from the optimal solution, $w^*$). Since we're interested in evaluating the performance of GD, this is the expression we want to analyze. There are two immediate observations. The first is that this term goes to 0 as $k$ goes to infinity, which is of course good news. The second is that the error decomposes very nicely into the separate elements of $x_0$, which is even nicer for the sake of our analysis. Here I quote from the original post, since I think they explain it nicely:

Each element of $x^0$ is the component of the error in the initial guess in the $Q$-basis. There are $n$ such errors, and each of these errors follows its own, solitary path to the minimum, decreasing exponentially with a compounding rate of $1-\alpha \lambda_i $. The closer that number is to 1, the slower it converges.

I hope this clears things up for you enough that you can go on to continue reading the post. It's a really good one!

Best Answer

Related Solutions

Solved – How to make stochastic gradient descent algorithm converge to the optimum

Solved – Why is gradient descent inefficient for large data set

Related Question