[Math] Stochastic Gradient Descent (SGD) algorithm explanation

gradient descentmachine learningoptimizationstatistics

I am new to machine learning and am currently trying to understand the Stochastic Gradient Descent (SGD) algorithm.
$$
w := w – \eta\nabla Q_i(w)
$$
As I understand it so far $Q_i(w)$ is what is going to be estimated – $i$ being the current dataset under observation.

Can anyone help break down what the equation means?
Thank you

Best Answer

I am not sure how much you know, but basically SGD is trying to find the best $w$, by minimizing $Q$. Generally, $Q$ is an error function; thus, by following the gradient direction in the space of values of $w$, we are going towards to the $w$ that minimizes the error. More concretely, if you consider linear regression or perceptron classification, $w$ specifies the weight parameters of the model and $Q(w)$ is a measure of the error that the model makes on the data.

To make sure you understand, normal gradient descent is written: $$ w \leftarrow w - \eta \nabla Q(w) $$ where the error objective is written (with its gradient): $$ Q(w) = \frac{1}{n}\sum_i Q_i(w)\;\;\;\;\implies\;\;\;\; \nabla Q(w) = \frac{1}{n}\sum_i \nabla Q_i(w) $$ Let's break it down: rewriting as an iteration via $w_{t+1}=w_t - \eta \nabla Q(w_t)$ makes it a bit easier to see. Basically, $w$ is is the machine learning model (i.e. its parameters, which determine its behaviour). So, given some $w_t$, we will add some small vector to it (or take a small step, if you prefer), to move in the direction of the optimal $w_*$ (which specifies the model with the least $Q$). The size of the step is determined by the $\eta\in\mathbb{R}$ parameter. Then, the direction is determined by the gradient of $Q$. (Note that when the gradient is zero, all the partial derivatives of $Q$ with respect to $w$ have vanished, meaning we have reached a minimum. If you are confused as to why the gradient is the direction of steepest descent, see here).

Now for the stochastic part: do you see a possible problem with the regular method? Imagine $n$ is enormous, and/or $\nabla Q(w)$ is complex (e.g. in deep neural networks). Then, computationally, evaluating $\nabla Q(w)$ may be very costly.

The common solution is to just estimate $\nabla Q$, instead of actually calculating it. The obvious and simplest method is to pick a random $j$ and use $\nabla Q \approx \nabla Q_j$ or (perhaps better) a minibatch (note indeed even that $\mathbb{E}_j[\nabla Q_j]=\nabla Q$).

This has another benefit, beyond computational efficiency, in that it helps avoid local minima in the error function, which "real" gradient descent could get stuck in.

See also here.