The applicability of batch or stochastic gradient descent really depends on the error manifold expected.
Batch gradient descent computes the gradient using the whole dataset. This is great for convex, or relatively smooth error manifolds. In this case, we move somewhat directly towards an optimum solution, either local or global. Additionally, batch gradient descent, given an annealed learning rate, will eventually find the minimum located in it's basin of attraction.
Stochastic gradient descent (SGD) computes the gradient using a single sample. Most applications of SGD actually use a minibatch of several samples, for reasons that will be explained a bit later. SGD works well (Not well, I suppose, but better than batch gradient descent) for error manifolds that have lots of local maxima/minima. In this case, the somewhat noisier gradient calculated using the reduced number of samples tends to jerk the model out of local minima into a region that hopefully is more optimal. Single samples are really noisy, while minibatches tend to average a little of the noise out. Thus, the amount of jerk is reduced when using minibatches. A good balance is struck when the minibatch size is small enough to avoid some of the poor local minima, but large enough that it doesn't avoid the global minima or better-performing local minima. (Incidently, this assumes that the best minima have a larger and deeper basin of attraction, and are therefore easier to fall into.)
One benefit of SGD is that it's computationally a whole lot faster. Large datasets often can't be held in RAM, which makes vectorization much less efficient. Rather, each sample or batch of samples must be loaded, worked with, the results stored, and so on. Minibatch SGD, on the other hand, is usually intentionally made small enough to be computationally tractable.
Usually, this computational advantage is leveraged by performing many more iterations of SGD, making many more steps than conventional batch gradient descent. This usually results in a model that is very close to that which would be found via batch gradient descent, or better.
The way I like to think of how SGD works is to imagine that I have one point that represents my input distribution. My model is attempting to learn that input distribution. Surrounding the input distribution is a shaded area that represents the input distributions of all of the possible minibatches I could sample. It's usually a fair assumption that the minibatch input distributions are close in proximity to the true input distribution. Batch gradient descent, at all steps, takes the steepest route to reach the true input distribution. SGD, on the other hand, chooses a random point within the shaded area, and takes the steepest route towards this point. At each iteration, though, it chooses a new point. The average of all of these steps will approximate the true input distribution, usually quite well.
The original Adam paper briefly explains what it means by "invariant to diagonal rescaling of the gradients" at the end of section 2.1.
I would try to explain it in some more detail.
Like stochastic gradient descent (SGD), Adam is an iterative method that uses gradients in order to find a minimum of a function.
(By "gradients" I mean "the values of the gradient in different locations in parameter space". I later use "partial derivatives" in a similar fashion.)
But in contrast to SGD, Adam doesn't really use gradients. Instead, Adam uses the partial derivatives of each parameter independently.
(By "partial derivative of a parameter $x$" I mean "partial derivative of the cost function $C$ with respect to $x$", i.e. $\frac{\partial C}{\partial x}$.)
Let $\Delta^{(t)}$ be the step that Adam takes in parameter space in the $t^{\text{th}}$ iteration. Then the step it takes in the dimension of the $j^{\text{th}}$ parameter (in the $t^{\text{th}}$ iteration) is $\Delta^{(t)}_j$, which is given by:
$$\Delta^{(t)}_j=-\frac{\alpha}{\sqrt{{\hat v}^{(t)}_j}+\epsilon}\cdot {\hat m}^{(t)}_j$$
while:
- $\alpha$ is the learning rate hyperparameter.
- $\epsilon$ is a small hyperparameter to prevent division by zero.
- ${\hat m}^{(t)}_j$ is an exponential moving average of the partial derivatives of the $j^{\text{th}}$ parameter that were calculated in iterations $1$ to $t$.
- ${\hat v}^{(t)}_j$ is an exponential moving average of the squares of the partial derivatives of the $j^{\text{th}}$ parameter that were calculated in iterations $1$ to $t$.
Now, what happens when we scale the partial derivative of the $j^{\text{th}}$ parameter by a positive factor $c$?
(I.e. the partial derivative of the $j^{\text{th}}$ parameter is just a function whose domain is the parameter space, so we can simply multiply its value by $c$.)
- ${\hat m}^{(t)}_j$ becomes $c\cdot{\hat m}^{(t)}_j$
- ${\hat v}^{(t)}_j$ becomes $c^2\cdot{\hat v}^{(t)}_j$
- Thus (using the fact that $c>0$), we get that $\Delta^{(t)}_j$ becomes:
$$-\frac{\alpha}{\sqrt{c^2\cdot{\hat v}^{(t)}_j}+\epsilon}\cdot c\cdot{\hat m}^{(t)}_j=-\frac{\alpha}{c\cdot\sqrt{{\hat v}^{(t)}_j}+\epsilon}\cdot c\cdot{\hat m}^{(t)}_j$$
And assuming $\epsilon$ is very small, we get:
$$\begin{gathered}
-\frac{\alpha}{c\cdot\sqrt{{\hat v}^{(t)}_j}+\epsilon}\cdot c\cdot{\hat m}^{(t)}_j\approx -\frac{\alpha}{c\cdot\sqrt{{\hat v}^{(t)}_j}}\cdot c\cdot{\hat m}^{(t)}_j=\\
-\frac{\alpha}{\sqrt{{\hat v}^{(t)}_j}}\cdot{\hat m}^{(t)}_j\approx-\frac{\alpha}{\sqrt{{\hat v}^{(t)}_j}+\epsilon}\cdot{\hat m}^{(t)}_j
\end{gathered}$$
I.e. scaling the partial derivative of the $j^{\text{th}}$ parameter by a positive factor $c$ actually doesn't affect $\Delta^{(t)}_j$.
Finally, let $g=\left(\begin{gathered}g_{1}\\
g_{2}\\
\vdots
\end{gathered}
\right)$ be the gradient. Then $g_j$ is the partial derivative of the $j^{\text{th}}$ parameter.
What happens when we multiply the gradient by a diagonal matrix with only positive elements?
$$\left(\begin{matrix}c_{1}\\
& c_{2}\\
& & \ddots
\end{matrix}\right)g=\left(\begin{matrix}c_{1}\\
& c_{2}\\
& & \ddots
\end{matrix}\right)\left(\begin{gathered}g_{1}\\
g_{2}\\
\vdots
\end{gathered}
\right)=\left(\begin{gathered}c_{1}\cdot g_{1}\\
c_{2}\cdot g_{2}\\
\vdots
\end{gathered}
\right)$$
So it would only scale each partial derivative by a positive factor, but as we have seen above, this won't affect the steps that Adam takes.
In other words, Adam is invariant to multiplying the gradient by a diagonal matrix with only positive factors, which is what the paper means by "invariant to diagonal rescaling of the gradients".
With regard to the quote from the paper Normalized Direction-preserving Adam, it describes the "ill-conditioning problem". (This is how the paper names the problem. Note that it is a different problem from the problem of an ill-conditioned Hessian.)
It seems to me that this problem is unrelated to Adam (and unrelated to the fact that it is invariant to rescaling of the gradient). I deduced that mostly from two other quotes in the paper, that elaborate on the ill-conditioning problem:
-
Furthermore, even when batch normalization is not used, a network using linear rectifiers (e.g., ReLU, leaky ReLU) as activation functions, is still subject to ill-conditioning of the parameterization (Glorot et al., 2011), and hence the same problem. We refer to this problem as the ill-conditioning problem.
The quote refers to the paper Deep Sparse Rectifier Neural Networks, which never mentions Adam, and also describes the problem of "ill-conditioning of the parametrization", which seems to me very similar (if not identical) to the "ill-conditioning problem".
-
The ill-conditioning problem occurs when the magnitude change of an input weight vector can be compensated by other parameters, such as the scaling factor of batch normalization, or the output weight vector, without affecting the overall network function. Consequently, suppose we have two DNNs that parameterize the same function, but with some of the input weight vectors having different magnitudes, applying the same SGD or Adam update rule will, in general, change the network functions in different ways. Thus, the ill-conditioning problem makes the training process inconsistent and difficult to control.
If I understand correctly, this quote says that both SGD and Adam suffer from the ill-conditioning problem. I.e. the problem isn't unique to Adam.
Best Answer
The Adam paper says, "...many objective functions are composed of a sum of subfunctions evaluated at different subsamples of data; in this case optimization can be made more efficient by taking gradient steps w.r.t. individual subfunctions..." Here, they just mean that the objective function is a sum of errors over training examples, and training can be done on individual examples or minibatches. This is the same as in stochastic gradient descent (SGD), which is more efficient for large scale problems than batch training because parameter updates are more frequent.
As for why Adam works, it uses a few tricks.
One of these tricks is momentum, which can give faster convergence. Imagine an objective function that's shaped like a long, narrow canyon that gradually slopes toward a minimum. Say we want to minimize this function using gradient descent. If we start from some point on the canyon wall, the negative gradient will point in the direction of steepest descent, i.e. mostly toward the canyon floor. This is because the canyon walls are much steeper than the gradual slope of the canyon toward the minimum. If the learning rate (i.e. step size) is small, we could descend to the canyon floor, then follow it toward the minimum. But, progress would be slow. We could increase the learning rate, but this wouldn't change the direction of the steps. In this case, we'd overshoot the canyon floor and end up on the opposite wall. We would then repeat this pattern, oscillating from wall to wall while making slow progress toward the minimum. Momentum can help in this situation.
Momentum simply means that some fraction of the previous update is added to the current update, so that repeated updates in a particular direction compound; we build up momentum, moving faster and faster in that direction. In the case of the canyon, we'd build up momentum in the direction of the minimum, since all updates have a component in that direction. In contrast, moving back and forth across the canyon walls involves constantly reversing direction, so momentum would help to damp the oscillations in those directions.
Another trick that Adam uses is to adaptively select a separate learning rate for each parameter. Parameters that would ordinarily receive smaller or less frequent updates receive larger updates with Adam (the reverse is also true). This speeds learning in cases where the appropriate learning rates vary across parameters. For example, in deep networks, gradients can become small at early layers, and it make sense to increase learning rates for the corresponding parameters. Another benefit to this approach is that, because learning rates are adjusted automatically, manual tuning becomes less important. Standard SGD requires careful tuning (and possibly online adjustment) of learning rates, but this less true with Adam and related methods. It's still necessary to select hyperparameters, but performance is less sensitive to them than to SGD learning rates.
Related methods:
Momentum is often used with standard SGD. An improved version is called Nesterov momentum or Nesterov accelerated gradient. Other methods that use automatically tuned learning rates for each parameter include: Adagrad, RMSprop, and Adadelta. RMSprop and Adadelta solve a problem with Adagrad that could cause learning to stop. Adam is similar to RMSprop with momentum. Nadam modifies Adam to use Nesterov momentum instead of classical momentum.
References:
Kingma and Ba (2014). Adam: A Method for Stochastic Optimization.
Goodfellow et al. (2016). Deep learning, chapter 8.
Slides from Geoff Hinton's course
Dozat (2016). Incorporating Nesterov Momentum into Adam.