I am trying to understand gradient descent optimization in ML(machine learning) algorithms. I understand that there's a cost function—where the aim is to minimize the error $\hat y-y$. In a scenario where weights $w_1, w_2$ are being optimized to give the minimum error, and partial derivatives are being used, does it change both $w_1$ and $w_2$ in each step or is it a combination (e.g., in few iterations only $w_1$ is changed and when $w_1$ isn't reducing the error any more, the derivative starts with $w_2$)? The application could be a linear regression model, a logistic regression model, or boosting algorithms.
Solved – Gradient descent optimization
gradient descentoptimization
Related Solutions
This is more of a workaround than a direct solution, but a common way to avoid local minima is to run your algorithm several times, from different starting locations. You can then take the best outcome or the average as your final result.
The reason why you might want to take the average rather than the best is to avoid overfitting. Many model types where local minima are a problem have lots of parameters: decision trees, neural networks, etc. Simply taking the best outcome risks obtaining a model that won't generalise well to future data. Taking the average guards against this.
You can get into arbitrarily complex ways of doing the averaging. Have a look at
assuming the error function is twice continuously differentiable and coercive (error goes to infinity as parameters theta go to infinity - eg if you L1/L2 norm regularisation on the parameters) then I believe the answer is yes, and I will sketch a proof, which hopefully identifies the key concepts involved.
I will drop $X,y$ to simplify notation
Coercivity basically allows you to look at a finite area in your parameter set (mathematical term compact set): calculate your error at parameter Theta=0, then you are only interested in thetas with regularisation norm less than error(Theta=0) [eg if $J(\theta)=error(\theta) + \alpha ||\theta||^2$ then we are only interested in $||\theta||^2\le error(0)/\alpha=:K$ since outside this region the regularisation term alone leads to higher $J(\theta)$].
Applying Taylor's theorem with remainder to the change $J(\theta^{(t)}) - J(\theta^{(t-1)})$ with stepsize $\gamma$:
$J(\theta^{(t)}) - J(\theta^{(t-1)}) = -\gamma \nabla J(\theta^{(t-1)})\cdot \nabla J(\theta^{(t-1)})+ \gamma^2 \nabla J(\theta^{(t-1)})^T H((1-\eta)\theta^{(t-1)}+\eta \theta^{(t-1)})\cdot \nabla J(\theta^{(t-1)})$
Here $H$ is the matrix of second derivatives wrt $\theta$ and $\eta$ is an unknown term strictly between 0 and 1 given to us by Taylor's theorem.
So to ensure this change is positive we require $\gamma < \frac {2}{max_{\|\theta\|^2\le K}\sigma(H)}$
where $\sigma(H)$ is the maximum eigenvalue of $H(\theta)$.
Because $\|\theta\|^2\le K$ is a compact set and $\sigma(H)$ is a continuous function of $\theta$ then a (finite) maximum exists over the region bounded by K (extreme value theorem), and so we can find a corresponding small enough step size.
Best Answer
Gradient descent updates all parameters at each step. You can see this in the update rule:
$$ w^{(t+1)}=w^{(t)} - \eta\nabla f\left(w^{(t)}\right). $$
Since the gradient of the loss function $\nabla f(w)$ is vector-valued with dimension matching that of $w$, all parameters are updated at each iteration.
The learning rate $\eta$ is a positive number that re-scales the gradient. Taking too large a step can endlessly bounce you across the loss surface with no improvement in your loss function; too small a step can mean tediously slow progress towards the optimum.
Although you could estimate linear regression parameters using gradient descent, it's not a good idea.
Likewise, there are better ways to estimate logistic regression coefficients.