[Math] Expectation of gradient in stochastic gradient descent algorithm

gradient descentoptimizationprobabilitystatistics

I'm studying stochastic gradient descent algorithm for optimization. It looks like this:

$L(w) = \frac{1}{N} \sum_{n=1}^{N} L_n(w)$

$w^{(t+1)} = w^{(t)} – \gamma \nabla L_n(w^{(t)})$

I assume that $n$ is chosen randomly each time the algorithm iterates. (¿?)

The problem comes when my notes state that $E[\nabla L_n(w)] = \nabla L(w)$. Where does this come from?

Best Answer

Let's assume we are talking about stochastic gradient descent where we update the weights based on a single example (not minibatch), out of a total data set of size $N$.

The total error over the whole set is given by: $$ L(w) = \frac{1}{N}\sum\limits_{n=1}^{N} L_n(w) $$ Then, at every step, a random sample point $n\sim U$ is chosen, and we update the weights via: $$ w \leftarrow w - \gamma\nabla L_n(w) $$ where $U$ means uniform over the data set. Now we want to know whether $\mathbb{E}_{n\sim U}[\nabla L_n(w)]=\nabla L(w)$.

We show this as follows: \begin{align} \mathbb{E}_{n\sim U}[\nabla L_n(w)] &= \nabla\; \mathbb{E}_{n\sim U}[ L_n(w)] \\ &= \nabla \sum\limits_{i=1}^N P(n=i) L_i(w)\\ &= \nabla \frac{1}{N}\sum\limits_{i=1}^{N} L_i(w)\\ &= \nabla L(w) \end{align}

The first step is probably the nastiest (although not in the discrete case I guess), but we can interchange the gradient and expectation assuming $L$ is sufficiently smooth and bounded (which it probably is). See here and here.

The other steps are just the definition of discrete expectation (but should still work assuming continuous spaces as well).

Related Question