Solved – Convergence of Stochastic Gradient Descent as a function of training set size

deep learninggradient descentmachine learning

I am going through the following section of the book by (Goodfellow et al., 2016), and I don't understand it quite well.

Stochastic gradient descent has many important uses outside the
context of deep learning. It is the main way to train large linear
models on very large datasets. For a fixed model size, the cost per SGD
update does not depend on the training set size $m$. In practice, we often
use a larger model as the training set size increases, but we are not
forced to do so. The number of updates required to reach convergence
usually increases with training set size. However,
as $m$ approaches infinity, the model will eventually converge to its best
possible test error before SGD has sampled every example in the
training set. Increasing $m$ further will not extend the amount of training
time needed to reach the model’s best possible test error. From this
point of view, one can argue that the asymptotic cost of training a
model with SGD is $O(1)$ as a function of $m$.

Section 5.9, p.150

"The number of updates required to reach convergence usually increases with training set size". I can't get away around this one. In the normal gradient descent, it becomes computationally expensive to calculate the gradient at each step as the number of training examples increases. But I don't understand why the number of updates increases with the training size.
"However, as $m$ approaches inﬁnity, the model will eventually converge to its best possible test error before SGD has sampled every example in the training set. Increasing $m$ further will not extend the amount of training time needed to reach the model’s best possible test error." I don't understand this as well

Can you provide some intuition/ arguments for the above two cases?

Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, 2016.

Best Answer

In the first part they are talking about large-scale SGD convergence in practice and in the second part theoretical results on the convergence of SGD when the optimisation problem is convex.

"The number of updates required to reach convergence usually increases with training set size".

I found this statement confusing but as @DeltaIV kindly pointed out in the comments I think they are talking about practical considerations for a fixed model as the dataset size $m \rightarrow \infty$. I think there are two relevant phenomena:

performance tradeoffs when you try to do distributed SGD, or
performance on a real-world non-convex optimisation problem

Computational tradeoffs for distributed SGD

In a large volume and high rate data scenario, you might want to try to implement a distributed version of SGD (or more likely minibatch SGD). Unfortunately making a distributed, efficient version of SGD is difficult as you need to frequently share the parameter state $w$. In particular, you incur a large overhead cost for synchronising between computers so it incentivises you use larger minibatchs. The following figure from (Li et al., 2014) illustrates this nicely

SGD here is minibatch SGD and the other lines are algorithms they propose.

However it's also known that for convex problems minibatchs incur a computational cost which effectively slows convergence with increased minibatch size. As you increase minibatch size $m'$ towards the dataset size $m$ it becomes slower and slower until you're just doing regular batch gradient descent. This decrease can be kinda drastic if a large mini-batch size is used. Again this is illustrated nicely by (Li et al, 2014):

Here they are plotting the minimum objective value they found on the KDD04 dataset after using $10^7$ datapoints.

Formally, if $f$ is a strongly convex function with global optimum $w^*$, and if $w_t$ is the output of SGD at the $t^{th}$ iteration. Then you can prove that in expectation you have: $$ \mathbb{E}[f(w_t) - f(w^*)] \leq \mathcal{O}(\frac{1}{\sqrt{t}}). $$ Note that this doesn't depend on the dataset size (this is relevant to your next question)! However for minibatch SGD with batch size $b$ the convergence to the optimum happens at a rate of $\mathcal{O}(\frac{1}{\sqrt{bt}} + \frac{1}{bt})$. When you have a very large amount of data (Li et al., 2014) make the point that:

Since the total number of examples examined is $bt$ while there is only a $\sqrt{b}$ times improvement, the convergence speed degrades with increasing minibatch size.

You have a trade-off between the synchronisation cost for small mini-batches, and the performance penalty for increased minibatch size. If you naively parallelise (minibatch) SGD you pay some penalty and your convergence rate slows as the data size increases.

Nonconvexity of the empirical optimisation problem

This is basically the point that @dopexxx has made. It's pretty well-known that the optimisation problem that you want to solve for deep neural nets in practice is not convex. In particular, it can have the following bad properties:

local minima and saddle points
plateau regions
widely varying curvature

In general the shape you're trying to optimise over is thought to be a manifold and as far as I am aware all you can say about convergence is that you're going converge to (a noise ball around) a stationary point. It seems reasonable that the greater the variety of real world data, the more complicated this manifold will be. Because the real gradient is more complicated as $m$ increase you need more samples to approximate $\nabla f$ with SGD.

"However, as m approaches inﬁnity, the model will eventually converge to its best possible test error before SGD has sampled every example in the training set. Increasing m further will not extend the amount of training time needed to reach the model’s best possible test error."

(Goodfellow et al., 2016) state this a little more precisely in the discussion in section 8.3.1, where they state:

The most important property of SGD and the related minibatch or online gradient-based optimisation is that computation time per update does not grow with the number of training examples. This allows convergence even when the number of training examples becomes very large. For a large enough dataset, SGD may converge to within some fixed tolerance of its final test set error before it has processed the entire training dataset.

(Bottou et al., 2017) offer the following clear explanation:

On an intuitive level, SG employs information more efficiently than a batch method. To see this, consider a situation in which a training set, call it $S$, consists of ten copies of a set $S_{sub}$. A minimizer of empirical risk for the larger set S is clearly given by a minimizer for the smaller set $S_{sub}$, but if one were to apply a batch approach to minimize $R_n$ over $S$, then each iteration would be ten times more expensive than if one only had one copy of $S_{sub}$. On the other hand, SG performs the same computations in both scenarios, in the sense that the stochastic gradient computations involve choosing elements from $S_{sub}$ with the same probabilities.

where $R_n$ is the empirical risk (essentially the training loss). If you let the number of copies get arbitrarily large, then SGD will certainly converge before it's sampled every data point.

I think this agrees with my statistical intuition. You can achieve $|x^{\text{opt}} - x_k| < \epsilon$ at iteration $k$ before you necessarily sample all the points in your dataset, because your tolerance for error $\epsilon$ really determines how much data you need to look at, i.e. how many iterations of SGD you need to complete.

I found the first part of the paper by (Bottou et al., 2017) quite helpful for understanding SGD better.

References

Bottou, Léon, Frank E. Curtis, and Jorge Nocedal. "Optimization methods for large-scale machine learning." arXiv preprint arXiv:1606.04838 (2016). https://arxiv.org/pdf/1606.04838.pdf

Li, Mu, et al. "Efficient mini-batch training for stochastic optimization." Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2014. https://www.cs.cmu.edu/~muli/file/minibatch_sgd.pdf

Best Answer

"The number of updates required to reach convergence usually increases with training set size".

"However, as m approaches inﬁnity, the model will eventually converge to its best possible test error before SGD has sampled every example in the training set. Increasing m further will not extend the amount of training time needed to reach the model’s best possible test error."

References

Related Solutions

Solved – Convergence Criteria for Stochastic Gradient Descent

Solved – Minibatching in Stochastic Gradient Descent and in Q-Learning

Related Question