I am going through the following section of the book by (Goodfellow et al., 2016), and I don't understand it quite well.
Stochastic gradient descent has many important uses outside the
context of deep learning. It is the main way to train large linear
models on very large datasets. For a fixed model size, the cost per SGD
update does not depend on the training set size $m$. In practice, we often
use a larger model as the training set size increases, but we are not
forced to do so. The number of updates required to reach convergence
usually increases with training set size. However,
as $m$ approaches infinity, the model will eventually converge to its best
possible test error before SGD has sampled every example in the
training set. Increasing $m$ further will not extend the amount of training
time needed to reach the model’s best possible test error. From this
point of view, one can argue that the asymptotic cost of training a
model with SGD is $O(1)$ as a function of $m$.Section 5.9, p.150
- "The number of updates required to reach convergence usually increases with training set size". I can't get away around this one. In the normal gradient descent, it becomes computationally expensive to calculate the gradient at each step as the number of training examples increases. But I don't understand why the number of updates increases with the training size.
- "However, as $m$ approaches infinity, the model will eventually converge to its best possible test error before SGD has sampled every example in the training set. Increasing $m$ further will not extend the amount of training time needed to reach the model’s best possible test error." I don't understand this as well
Can you provide some intuition/ arguments for the above two cases?
Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, 2016.
Best Answer
In the first part they are talking about large-scale SGD convergence in practice and in the second part theoretical results on the convergence of SGD when the optimisation problem is convex.
"The number of updates required to reach convergence usually increases with training set size".
I found this statement confusing but as @DeltaIV kindly pointed out in the comments I think they are talking about practical considerations for a fixed model as the dataset size $m \rightarrow \infty$. I think there are two relevant phenomena:
Computational tradeoffs for distributed SGD
In a large volume and high rate data scenario, you might want to try to implement a distributed version of SGD (or more likely minibatch SGD). Unfortunately making a distributed, efficient version of SGD is difficult as you need to frequently share the parameter state $w$. In particular, you incur a large overhead cost for synchronising between computers so it incentivises you use larger minibatchs. The following figure from (Li et al., 2014) illustrates this nicely
SGD here is minibatch SGD and the other lines are algorithms they propose.
However it's also known that for convex problems minibatchs incur a computational cost which effectively slows convergence with increased minibatch size. As you increase minibatch size $m'$ towards the dataset size $m$ it becomes slower and slower until you're just doing regular batch gradient descent. This decrease can be kinda drastic if a large mini-batch size is used. Again this is illustrated nicely by (Li et al, 2014):
Here they are plotting the minimum objective value they found on the KDD04 dataset after using $10^7$ datapoints.
Formally, if $f$ is a strongly convex function with global optimum $w^*$, and if $w_t$ is the output of SGD at the $t^{th}$ iteration. Then you can prove that in expectation you have: $$ \mathbb{E}[f(w_t) - f(w^*)] \leq \mathcal{O}(\frac{1}{\sqrt{t}}). $$ Note that this doesn't depend on the dataset size (this is relevant to your next question)! However for minibatch SGD with batch size $b$ the convergence to the optimum happens at a rate of $\mathcal{O}(\frac{1}{\sqrt{bt}} + \frac{1}{bt})$. When you have a very large amount of data (Li et al., 2014) make the point that:
You have a trade-off between the synchronisation cost for small mini-batches, and the performance penalty for increased minibatch size. If you naively parallelise (minibatch) SGD you pay some penalty and your convergence rate slows as the data size increases.
Nonconvexity of the empirical optimisation problem
This is basically the point that @dopexxx has made. It's pretty well-known that the optimisation problem that you want to solve for deep neural nets in practice is not convex. In particular, it can have the following bad properties:
In general the shape you're trying to optimise over is thought to be a manifold and as far as I am aware all you can say about convergence is that you're going converge to (a noise ball around) a stationary point. It seems reasonable that the greater the variety of real world data, the more complicated this manifold will be. Because the real gradient is more complicated as $m$ increase you need more samples to approximate $\nabla f$ with SGD.
"However, as m approaches infinity, the model will eventually converge to its best possible test error before SGD has sampled every example in the training set. Increasing m further will not extend the amount of training time needed to reach the model’s best possible test error."
(Goodfellow et al., 2016) state this a little more precisely in the discussion in section 8.3.1, where they state:
(Bottou et al., 2017) offer the following clear explanation:
where $R_n$ is the empirical risk (essentially the training loss). If you let the number of copies get arbitrarily large, then SGD will certainly converge before it's sampled every data point.
I think this agrees with my statistical intuition. You can achieve $|x^{\text{opt}} - x_k| < \epsilon$ at iteration $k$ before you necessarily sample all the points in your dataset, because your tolerance for error $\epsilon$ really determines how much data you need to look at, i.e. how many iterations of SGD you need to complete.
I found the first part of the paper by (Bottou et al., 2017) quite helpful for understanding SGD better.
References
Bottou, Léon, Frank E. Curtis, and Jorge Nocedal. "Optimization methods for large-scale machine learning." arXiv preprint arXiv:1606.04838 (2016). https://arxiv.org/pdf/1606.04838.pdf
Li, Mu, et al. "Efficient mini-batch training for stochastic optimization." Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2014. https://www.cs.cmu.edu/~muli/file/minibatch_sgd.pdf