Interchange of Limits: Gradient and Expectation

expected valuegradient descentintegrationlimits

In the excellent book "Understanding Machine Learning: From Theory to Algorithms", the authors prove the convergence properties for the stochastic gradient descent (SGD) algorithm in Chapter 14.

As part of proving the applicability of the convergence theorems, they make the following statement on page 197:

"…by the linearity of the gradient we have:

The crux of the argument here is interchanging of limits: specifically interchanging the expectation (with respect to the variable z, representing some random sampling of data points) and the gradient operator (with respect to $w^{(t)}$, representing some structural parameters of the loss function) on the loss function.

While the authors justify this with the simple statement "by the linearity of the gradient", I believe this needs to be justified more rigorously so I'm trying to understand what theorems help us here.

So far my best attempt is to say that if the loss function $\ell$ is assumed continuously differentiable, then the conditions for applying Leibnitz integral rule apply and we can use it with respect to each one of the partial derivatives of $\ell$ (e.g. with respect to any scalar component of $w^{(t)}$), and hence this applies to the entire gradient.

Is this the right idea here? or any other results are relevant, e.g. if we make further assumptions on the loss function (convexity, Lipschitz continuity, etc.)?

Best Answer

I think that Lipschitzness of $\ell$ is a sufficient condition for that identity to hold. Indeed, for any $i$ : $$\begin{align}\mathbb E\left[\frac{\partial \ell}{\partial w_i}\right] &= \mathbb E\left[\lim_{h\to0}\frac{\ell(w_i+h)-\ell(w_i)}{h}\right]\\ &=\mathbb E\left[\lim_{n\to\infty}\frac{\ell(w_i+1/n)-\ell(w_i)}{1/n}\right]\\ \end{align}$$ Now, if $\ell$ is $L$-Lipschitz, then the sequence $\frac{\ell(w_i+1/n)-\ell(w_i)}{1/n} $ is dominated by $L$ and thus the Dominated Convergence Theorem applies. Hence $$\begin{align}\mathbb E\left[\frac{\partial \ell}{\partial w_i}\right]&=\mathbb E\left[\lim_{n\to\infty}\frac{\ell(w_i+1/n)-\ell(w_i)}{1/n}\right]\\ &=\lim_{n\to\infty} \mathbb E\left[\frac{\ell(w_i+1/n)-\ell(w_i)}{1/n}\right]\\ &=\lim_{n\to\infty}\frac 1 n\left(\mathbb E[\ell(w_i+1/n)] - \mathbb E[\ell(w_i)]\right)\\ &=\lim_{n\to\infty}\frac{\mathbb E[\ell(w_i+1/n)] - \mathbb E[\ell(w_i)]}{1/n}\\ &=\frac{\partial}{\partial w_i}\mathbb E[\ell]\end{align} $$

Best Answer

Related Solutions

[Math] Momentum in gradient descent

Conditions to apply Gradient Descent

Related Question