The Adam paper says, "...many objective functions are composed of a sum of subfunctions evaluated at different subsamples of data; in this case optimization can be made more efficient by taking gradient steps w.r.t. individual subfunctions..." Here, they just mean that the objective function is a sum of errors over training examples, and training can be done on individual examples or minibatches. This is the same as in stochastic gradient descent (SGD), which is more efficient for large scale problems than batch training because parameter updates are more frequent.
As for why Adam works, it uses a few tricks.
One of these tricks is momentum, which can give faster convergence. Imagine an objective function that's shaped like a long, narrow canyon that gradually slopes toward a minimum. Say we want to minimize this function using gradient descent. If we start from some point on the canyon wall, the negative gradient will point in the direction of steepest descent, i.e. mostly toward the canyon floor. This is because the canyon walls are much steeper than the gradual slope of the canyon toward the minimum. If the learning rate (i.e. step size) is small, we could descend to the canyon floor, then follow it toward the minimum. But, progress would be slow. We could increase the learning rate, but this wouldn't change the direction of the steps. In this case, we'd overshoot the canyon floor and end up on the opposite wall. We would then repeat this pattern, oscillating from wall to wall while making slow progress toward the minimum. Momentum can help in this situation.
Momentum simply means that some fraction of the previous update is added to the current update, so that repeated updates in a particular direction compound; we build up momentum, moving faster and faster in that direction. In the case of the canyon, we'd build up momentum in the direction of the minimum, since all updates have a component in that direction. In contrast, moving back and forth across the canyon walls involves constantly reversing direction, so momentum would help to damp the oscillations in those directions.
Another trick that Adam uses is to adaptively select a separate learning rate for each parameter. Parameters that would ordinarily receive smaller or less frequent updates receive larger updates with Adam (the reverse is also true). This speeds learning in cases where the appropriate learning rates vary across parameters. For example, in deep networks, gradients can become small at early layers, and it make sense to increase learning rates for the corresponding parameters. Another benefit to this approach is that, because learning rates are adjusted automatically, manual tuning becomes less important. Standard SGD requires careful tuning (and possibly online adjustment) of learning rates, but this less true with Adam and related methods. It's still necessary to select hyperparameters, but performance is less sensitive to them than to SGD learning rates.
Related methods:
Momentum is often used with standard SGD. An improved version is called Nesterov momentum or Nesterov accelerated gradient. Other methods that use automatically tuned learning rates for each parameter include: Adagrad, RMSprop, and Adadelta. RMSprop and Adadelta solve a problem with Adagrad that could cause learning to stop. Adam is similar to RMSprop with momentum. Nadam modifies Adam to use Nesterov momentum instead of classical momentum.
References:
Kingma and Ba (2014). Adam: A Method for Stochastic Optimization.
Goodfellow et al. (2016). Deep learning, chapter 8.
Slides from Geoff Hinton's course
Dozat (2016). Incorporating Nesterov Momentum into Adam.
A (real) symmetric matrix has a complete set of orthogonal eigenvectors for which the corresponding eigenvalues are are all real numbers. For non-symmetric matrices this can fail. For example, a rotation in two dimensional space has no eigenvector or eigenvalues in the real numbers, you must pass to a vector space over the complex numbers to find them.
If the matrix is additionally positive definite, then these eigenvalues are all positive real numbers. This fact is much easier than the first, for if $v$ is an eigenvector with unit length, and $\lambda$ the corresponding eigenvalue, then
$$ \lambda = \lambda v^t v = v^t A v > 0 $$
where the last equality uses the definition of positive definiteness.
The importance here for intuition is that the eigenvectors and eigenvalues of a linear transformation describe the coordinate system in which the transformation is most easily understood. A linear transformation can be very difficult to understand in a "natural" basis like the standard coordinate system, but each comes with a "preferred" basis of eigenvectors in which the transformation acts as a scaling in all directions. This makes the geometry of the transformation much easier to understand.
For example, the second derivative test for the local extrema of a function $R^2 \rightarrow R$ is often given as a series of mysterious conditions involving an entry in the second derivative matrix and some determinants. In fact, these conditions simply encode the following geometric observation:
- If the matrix of second derivatives is positive definite, you're at a local minimum.
- If the matrix of second derivatives is negative definite, you're at a local maximum.
- Otherwise, you are at neither, a saddle point.
You can understand this with the geometric reasoning above in an eigenbasis. The first derivative at a critical point vanishes, so the rates of change of the function here are controlled by the second derivative. Now we can reason geometrically
- In the first case there are two eigen-directions, and if you move along either the function increases.
- In the second, two eigen-directions, and if you move in either the function decreases.
- In the last, there are two eigen-directions, but in one of them the function increases, and in the other it decreases.
Since the eigenvectors span the whole space, any other direction is a linear combination of eigen-directions, so the rates of change in those directions are linear combinations of the rates of change in the eigen directions. So in fact, this holds in all directions (this is more or less what it means for a function defined on a higher dimensional space to be differentiable). Now if you draw a little picture in your head, this makes a lot of sense out of something that is quite mysterious in beginner calculus texts.
This applies directly to one of your bullet points
The quadratic form $\frac 1 2 x^\top Ax-b^\top x +c$ is convex, if $A$ is SPD. Convex is a nice property that can make sure the local solution is global solution
The matrix of second derivatives is $A$ everywhere, which is symmetric positive definite. Geometrically, this means that if we move away in any eigen-direction (and hence any direction, because any other is a linear combination of eigen-directions) the function itself will bend away above it's tangent plane. This means the whole surface is convex.
Best Answer
If your equations are linear, as suggested by the link you provide, you need to do linear regression to solve all the equations simultaneously: