[Math] we update simultaneously all the variables in Gradient Descent

gradient descentoptimization

In the classic gradient descent algorithm, at each iteration step, we update all the variables simultaneously, i.e. $$\pmb{\theta}' \gets \pmb{\theta}-\mathbf{\alpha}\frac{\partial \mathbf{F}}{\partial \pmb{\theta}}$$

One alternative to this is that within each step we can update the variables as and when they are available.

For e.g. at each step: $$\pmb{\theta_1}' \gets \pmb{\theta_1}-\mathbf{\alpha}\frac{\partial \mathbf{F({\theta_1},\theta_2)}}{\partial \pmb{\theta_{1}}}$$
$$\pmb{\theta_2}' \gets \pmb{\theta_2}-\mathbf{\alpha}\frac{\partial \mathbf{F({\theta_1}',\theta_2)}}{\partial \pmb{\theta_{2}}}$$
I'm sure that this would also converge to the local optimum. So why is this alternate way of updation usually not the preferred way?

Edit: sometimes it makes sense not to update simultaneously. One use case would be that of training Neural Networks in NLP. Usually, we use Gradient Descent here but without the simultaneous updating because simultaneous updating from all the training examples takes a lot of time. Refere pg 33 of this pdf

Best Answer

A simple example, let $f = \sin(\sum_{i=1}^n \alpha_i \theta_i)$. To compute all derivatives at a point you only have to evaluate $\sin$ once. If you cycle through all variables, you will have to evaluate $\sin$ $n$ times as the argument changes. Most often, it pays off to do steps in all coordinates at the same time. A simple analogy would be walking. You typically don't walk east-west direction first, and then north-south. You walk the shortest direction, i.e., move in both coordinates simultaneously.

Related Question