Getting to the gradient descent algorithm

gradient descentnumerical optimizationoptimization

I understand that gradient descent comes from the (quite natural) idea that we might want to choose our next weight vector ($w^{t+1}$) as

$$w^{t+1} = \arg \min_w \frac{1}{2} \|w-w^t\|^{2} + \eta f(w^t) + \langle w-w^t, \nabla f(w^t) \rangle$$

because the approximation $$f(w) \approx f(w^t)~+ \langle w-w^t,\nabla f(w^t) \rangle$$ gets worse as $\|w-w^t\|$ grows.

But, how do I get from the first expression above to the gradient descent algorithm:
$$w^{t+1} = w^t + \nabla f(w^t)?$$
A step-by-step explanation would be much appreciated.

Best Answer

You can motivate gradient descent as both minimization of a quadratic approximation of the objective or a linear function with a penalty for moving away from the current point. Here's what you are minimizing actually: $$x_{k+1} = \arg\min_{x\in \mathbb{R}^n}{f(x_{k}) + \nabla f(x_k) \cdot (x-x_k) + \frac{1}{2\tau}||x-x_k||^2}$$ $$x_{k+1} = \arg\min_{x\in \mathbb{R}^n}{\nabla f(x_k) \cdot (x-x_k) + \frac{1}{2\tau}||x-x_k||^2}$$ $$\nabla f(x_k) + \frac{1}{\tau}(x_{k+1}-x_k) = 0$$ $$x_{k+1} = x_k - \tau \nabla f(x_k)$$ The second interpretation follows from considering the function $f_{x_k}(x) = f(x_k) + \nabla f(x_k) \cdot (x-x_k)$ which is obviously linear, and adding the penalty you get $f_{x_k}^{\tau}(x) = f_{x_k}(x) + \frac{1}{2\tau}||x-x_k||^2$ which is a $\tau^{-1}$ strongly convex function and thus has a unique minimizer $x_{k+1}$.

Related Question