Solved – Gradient descent optimization

gradient descentoptimization

I am trying to understand gradient descent optimization in ML(machine learning) algorithms. I understand that there's a cost function—where the aim is to minimize the error $\hat y-y$. In a scenario where weights $w_1, w_2$ are being optimized to give the minimum error, and partial derivatives are being used, does it change both $w_1$ and $w_2$ in each step or is it a combination (e.g., in few iterations only $w_1$ is changed and when $w_1$ isn't reducing the error any more, the derivative starts with $w_2$)? The application could be a linear regression model, a logistic regression model, or boosting algorithms.

Best Answer

Gradient descent updates all parameters at each step. You can see this in the update rule:

$$ w^{(t+1)}=w^{(t)} - \eta\nabla f\left(w^{(t)}\right). $$

Since the gradient of the loss function $\nabla f(w)$ is vector-valued with dimension matching that of $w$, all parameters are updated at each iteration.

The learning rate $\eta$ is a positive number that re-scales the gradient. Taking too large a step can endlessly bounce you across the loss surface with no improvement in your loss function; too small a step can mean tediously slow progress towards the optimum.

Although you could estimate linear regression parameters using gradient descent, it's not a good idea.

Likewise, there are better ways to estimate logistic regression coefficients.