[Math] Stochastic Gradient Descent (SGD) algorithm explanation

gradient descentmachine learningoptimizationstatistics

I am new to machine learning and am currently trying to understand the Stochastic Gradient Descent (SGD) algorithm.
$$
w := w – \eta\nabla Q_i(w)
$$
As I understand it so far $Q_i(w)$ is what is going to be estimated – $i$ being the current dataset under observation.

Can anyone help break down what the equation means?
Thank you

Best Answer

I am not sure how much you know, but basically SGD is trying to find the best $w$, by minimizing $Q$. Generally, $Q$ is an error function; thus, by following the gradient direction in the space of values of $w$, we are going towards to the $w$ that minimizes the error. More concretely, if you consider linear regression or perceptron classification, $w$ specifies the weight parameters of the model and $Q(w)$ is a measure of the error that the model makes on the data.

To make sure you understand, normal gradient descent is written: $$ w \leftarrow w - \eta \nabla Q(w) $$ where the error objective is written (with its gradient): $$ Q(w) = \frac{1}{n}\sum_i Q_i(w)\;\;\;\;\implies\;\;\;\; \nabla Q(w) = \frac{1}{n}\sum_i \nabla Q_i(w) $$ Let's break it down: rewriting as an iteration via $w_{t+1}=w_t - \eta \nabla Q(w_t)$ makes it a bit easier to see. Basically, $w$ is is the machine learning model (i.e. its parameters, which determine its behaviour). So, given some $w_t$, we will add some small vector to it (or take a small step, if you prefer), to move in the direction of the optimal $w_*$ (which specifies the model with the least $Q$). The size of the step is determined by the $\eta\in\mathbb{R}$ parameter. Then, the direction is determined by the gradient of $Q$. (Note that when the gradient is zero, all the partial derivatives of $Q$ with respect to $w$ have vanished, meaning we have reached a minimum. If you are confused as to why the gradient is the direction of steepest descent, see here).

Now for the stochastic part: do you see a possible problem with the regular method? Imagine $n$ is enormous, and/or $\nabla Q(w)$ is complex (e.g. in deep neural networks). Then, computationally, evaluating $\nabla Q(w)$ may be very costly.

The common solution is to just estimate $\nabla Q$, instead of actually calculating it. The obvious and simplest method is to pick a random $j$ and use $\nabla Q \approx \nabla Q_j$ or (perhaps better) a minibatch (note indeed even that $\mathbb{E}_j[\nabla Q_j]=\nabla Q$).

This has another benefit, beyond computational efficiency, in that it helps avoid local minima in the error function, which "real" gradient descent could get stuck in.

Best Answer

Related Solutions

Convergence rate of Gradient Descent

Gradient Descent and its Variants

Related Question