EDITS BELOW OP
OP:
- You're assuming all data are sampled from the same distribution, so on average the gradient computed at any step will be roughly equivalent to that of the global gradient. SGD is just gradient descent except, rather than computing the cost on the whole set, you just do it on a random subset. This is for speed. But the crucial point lies in that you're randomly selecting from a set whose distribution is equivalent to that of the entire set.
It can't escape the minima for the same reason that regular GD can't - the gradient always points "downhill", and if you're in a local minimum then, by definition, every direction is "uphill".
In general we don't know that there isn't a better (or worse) minimum. Indeed, many methods utilize lots of random initializations to try to mitigate problems from bad initial guesses (look up simulated annealing)
There is no definitive solution that works in every situation on avoiding local minima. Look up "gradient descent with momentum" - this is one method that adds a "momentum" term (inspired from physics) to help be robust against local minima, but if they are deep enough or "wide" enough, it will not matter.
This is why people try so hard to cast problems in a way where convex optimization can be used
Both BGD and SGD look for the nearest minimum. SGD is just a "low resolution" version of BGD.
We are not certain. I'm sure someone could do an analysis on the liklihood of leaving a local minimum of a certain description given the distribution the sample is pulled from, but I've not done this nor seen it done. It's not very important because in general we don't have a full description of the cost function in general. Remember, we are sampling the cost function at a point - in solving a real problem you would never have a graph like you do (otherwise, what's the point of SGD - just look at the graph)
This is a little weirdly phrased, but I think you're asking why gradient descent is more robust to small local minima. It stems from the stochasticity introduced by only sampling a small portion of the data to get your cost function. If you just happen to subsample points which have a gradient that pushes you out of a "true" local minimum, then great. You would never know that you have been saved from a "true" local minimum though.
Again, SGD was not invented to be robust against local minima. It was a simple extension of regular GD for the sake of speed and incremental updates in e.g. neural networks. If you want robustness against local minima, any form of regular GD is not what you're looking for. Again, think about adding a momentum term or checking out some other optimizers
Edit 1:
First, thanks to @Hyperplane for pointing out that SGD is not just used for speed, but also regularization. Check out the paper he linked below in the comments if interested.
Response to edits in the original post:
Since we randomly select each batch ... the whole dataset
This is essentially correct. We make the probabilistic assumption that all data are sampled from the same distribution. The random sample will exhibit different statistical attributes than the whole set (e.g. sample mean), but on average we expect that the gradient of the cost function computed from the subsample will be similar to that of the entire sample.
Note that in some cases in your GIF, it is actually quite far off - the expected difference will shrink as your subsample grows, and determining how big to make your subsample is a consideration you have to have.
However there is still a thing that confuses me ... pushed again towards a shallower minimum
There are a few things to say here.
I'm not sure if I can say with mathematical certainty that SGD is more likely to reach a better minimum - I have no proof of this. We can see intuitively that there could be reason for this, so it is "worth" trying in practice.
The step size relative to the "size" of the minimum matters. Let's assume for the sake of simplicity that our minimum sits at the bottom of a surface which describes the bottom half of a sphere. If the step size is 10 times the radius of sphere, we simply cannot mathematically get caught in this minimum (unless of course we happen to land exactly on the minimum)
On the other hand, if the radius of the sphere is 10e6 times greater than the step size, it is virtually impossible to leave this "bowl".
To directly answer your question about progressively "jumping up" to higher and higher minima, the main answer is the importance of tuning your learning rate (step size). Having a sufficiently small learning rate will make this virtually impossible.
Question 1 answer: Again, you're highlighting the importance of learning rate. With a sufficiently small learning rate, it is very unlikely that you would end up in a much higher minimum. If the learning rate is very high, it could happen or you could essentially just end up bouncing around randomly. If the learning rate is "medium" then you could end up in the global minimum or perhaps either of the first two to three "up the bowl" on either side of the global minimum. Depends on learning rate, small batch size, and the stochasticity introduced from random selection.
Despite the fact that ... In both these cases SGD can fail
I was actually going to answer something like this above, but I thought it was confusing. Either way here it is: The steepness of the minimum makes this very unlikely given a reasonable step size. Let's say for the sake of example that the gradient is -20 on the left and +20 on the right on the minimum. Remember that we expect our small batches to have similar gradients (derivatives) to the whole dataset. So we might expect a gradient range of say 15 to 25 in magnitude (negative on the left, positive on the right). We are still pointing in the right direction. Could you "hop out"? Sure. But it's very unlikely and would be a "fluke" of a random sample.
Question 2 answer: Depends on learning rate and initialization. If the learning rate is pretty low, then you could very well get stuck in any of the minima near x = -0.5, 0, or 2. This isn't really a bad thing in this particular case for either the 0 or 2 minima.
Question 3 answer: Again, "wide" is hard to define when you can have a function that is at any scale. Are we concerned with width-to-depth ratio (rhetorical)? In short the answer is no - if the learning rate is small enough then either will work. I can expand if you need.
Question 4 answer: I'm not sure what you're saying here. Guess the size of what - the "size" of the "bowls" around the minima? Recall that you can get much more complex behavior in higher dimensions (e.g. saddle points in 2D) which is where most applications lie (in higher dimensions I mean). In general, you don't know the "scale" of the relevant features of the function.
Take the 2nd function you added. It exists in the range -5 to 5 with important features on that order. You could've just as easily made this $-5^{-10}$ to $5^{-10}$, or $-5^{10}$ to $5^{10}$. A learning rate of say 0.01 could be useful for your function, but would be hopelessly useless for the other 2 scales (even impossible for the first)
Question 5 answer: As above, depth is important because even if the small batch gradients are not equal to the whole batch gradient, they will still generally be pointing in the right direction. The smaller the magnitude of the gradient, the easier it is for random sampling to go in a "bad" direction.
On the other hand, even though it is easier to go in a "bad" direction in a wider bowl, you have to take many "bad" steps in succession to leave the bowl. So there is a tradeoff, and we have seen again the importance of tuning your learning rate (and small batch size).
Look again at your GIF - around the big hills, the small batch derivates pretty closely math - at least they have the right direction. However, on the far left of your function, the derivatives don't as closely match.
Question 6 answer: You assume nothing beyond the assumptions of regular GD (namely continuity and differentiability). In general, SGD is used because it is faster and, as mentioned by @Hyperplane, is a form of regularization. It may leave local minima (but so too might regular GD with a high enough learning rate). I cannot give you a definitive rule for when it might be better to use one or the other. I'm not saying such a rule doesn't exist, but I do not know it if it does.
Edits 2:
Confusion 1 response: It depends what the source of the escape is. If you are asking whether or not the stochasticity introduced from using a minibatch is less likely to be the cause of leaving a deep minimum, then yes. The deeper the minimum is (given a fixed width), the higher the magnitude of the derivative and the less likely a minibatch will have a derivative that points in the "wrong" direction (i.e. has a different sign than the true whole batch derivative, in the 1D case).
If the source of the escape is that your learning rate is not tuned properly, then having a deeper minimum could actually make you more likely to leave. Remember, you're not really stepping "downhill" - you're taking a step left or right (in the 1D case) whose size depends on your learning rate and the derivative at that point. If the derivative has higher magnitude (i.e. the minimum is deeper), then it means you are taking a bigger step to the right or left. If the learning rate is high enough, you could very well take a step big enough to leave the well.
the smaller the batch size and the bigger the learning rate, the deeper minima the red point is looking for
This is not true for the reasons given in my last paragraph. Higher learning rate does not mean you are walking "downhill" with bigger steps, it means you are walking to the left or right with bigger steps and then "ending up on the function" at whatever altitude happens to correspond to that x value.
Imagine this: I have a global minimum at (0,0), and a global maximum at (1, 1000). Let's say I'm at x=-0.5, and the cost function happens to have a derivative of -1 there (i.e. "pointing" towards the minimum). Let's also assume that my learning rate is 1.5. **Then my update will take my current position and add $-derivative*learning\_rate=-(-1)*1.5=+1.5$. Therefore, my new x becomes $x_{new} = x_{old} + 1.5 = -0.5 + 1.5 = 1$. Look at that! I've ended up increasing my cost from 0 to 1000; and, worse yet, the derivative of the cost function at my new x value is 0, so I am stuck here.
This is obviously a very convenient and unlikely scenario, but it highlights the very important point that steps are taken on the x-axis, not on the "function surface". It is just to point out that you can leave minima if they are very skinny and deep if your learning rate is on the order of their width.
Confusion 2 response: We have been using the term "minimum" relatively loosely, and while it is convenient, there is a problem here. You have said
Now, when the red point gets into the minimum C, then each of 100 observations has 0 error and therefore 0 gradient. So, whatever batch you choose it's going to have 0 gradient, since its gradient is the sum of gradients of the observations the batch consists of. Consequently, it's impossible for the red point to escape from the minimum C.
We have been talking about being "in" a minimum as being in the "bowl" that surrounds it. The minimum itself is only the single point at which the derivative is 0. If you happen to land on a minimum exactly, then it is true you will never be able to leave. All updates will move you exactly 0. This is very unlikely. With the right learning rate, you will asymptotically converge to a minimum, but not reach it. Even if you happen to land on one (overwhelmingly unlikely for functions like those you show - the set of minima is measure zero), then you could still be pushed off of it just because of the errors introduced from floating-point computations.
I think the above might've answered your number 2 bullet under this confusion. In general, if you end up in a steep minimum with a low learning rate, you will never leave (steep and low are both relative here).
Confusion 3 answer: I'm a bit confused by your wording here. Importantly, "reduction in cost function" is not the same thing as a "deeper minimum" for the reason I mentioned above. Deeper minimum just means bigger magnitude derivative means bigger steps left or right. This could be highly detrimental.
Ultimately I think you shouldn't worry too much about this. Just think of SGD as a fast, low-resolution version of GD which happens to be slightly robust against extremely weak minima - this is more of an oddity than anything. If you are concerned with robustness then again look at (S)GD + Momentum.
Best Answer
I was able to reach Brandyn Webb (aka S. Funk) and he clarified some of my questions. The ideas that led to SGD SVD can be found in this paper
http://courses.cs.washington.edu/courses/cse528/09sp/sanger_pca_nn.pdf
where Sanger, building on work of Oja
http://users.ics.aalto.fi/oja/Oja1982.pdf
talks about finding N top eigenvectors through a method called Generalized Hebbian Algorithm (GHA) using a neural network with probability 1. That is, training this particular network ends up doing PCA (very cool).
Webb was researching NNs at the time (way before Netflix);
Later Gorrell and Webb realized the formulation looked like it was solving an SVD problem (again before Netflix Prize). Webb first started using the SGD derivation for Netflix, he mentioned that he "went for the direct gradient method on the spur of the moment because it was faster to derive that on a napkin than to find my old paper".
The question can be asked "how much SVD is Funk SVD?". Some on the Net call it "Funk's heuristic approach to SVD" and "in missing data form it's a factor model not SVD". Another says "in the case of a partially observed matrix [..] the factorized solution U V'is not truely an SVD – there is no constraint that the matrices U and V be orthogonal".
Webb does not agree, to the first two
for the last,
Overall, we can say there are connections between GHA, the so-called "iterative power method", gradient solution and SVD; It is because of this, perhaps, Webb could delve into a solution without being scared off by the non-convexity of the problem.
No matter what route was taken during the invention of this method however, we can look at the error function for recommendations, and through its gradients one can see SGD can minimize it, which is proven by empirical tests. The error function is non-convex, however as Bottou and LeCun suggest absence of theory should not stop us using a method. Plus people started to look at stochastic approaches, their theoretical guarantees much closer now, as seen here
http://arxiv.org/pdf/1308.3509
The paper above also talks about power method and SGD connection BTW.
Note: power method is used to find a single eval/evec, that is evec for the largest eval, however it can be used to find all other evecs by "removing" the newlyfound evec from matrix A (through a process called deflation), and re-running the power method on A again which finds the second eigenvector, and so on.
http://www.maths.qmul.ac.uk/~wj/MTH5110/notes/MAS235_lecturenotes1.pdf
http://heim.ifi.uio.no/~tom/powerandqrslides.pdf