The applicability of batch or stochastic gradient descent really depends on the error manifold expected.
Batch gradient descent computes the gradient using the whole dataset. This is great for convex, or relatively smooth error manifolds. In this case, we move somewhat directly towards an optimum solution, either local or global. Additionally, batch gradient descent, given an annealed learning rate, will eventually find the minimum located in it's basin of attraction.
Stochastic gradient descent (SGD) computes the gradient using a single sample. Most applications of SGD actually use a minibatch of several samples, for reasons that will be explained a bit later. SGD works well (Not well, I suppose, but better than batch gradient descent) for error manifolds that have lots of local maxima/minima. In this case, the somewhat noisier gradient calculated using the reduced number of samples tends to jerk the model out of local minima into a region that hopefully is more optimal. Single samples are really noisy, while minibatches tend to average a little of the noise out. Thus, the amount of jerk is reduced when using minibatches. A good balance is struck when the minibatch size is small enough to avoid some of the poor local minima, but large enough that it doesn't avoid the global minima or better-performing local minima. (Incidently, this assumes that the best minima have a larger and deeper basin of attraction, and are therefore easier to fall into.)
One benefit of SGD is that it's computationally a whole lot faster. Large datasets often can't be held in RAM, which makes vectorization much less efficient. Rather, each sample or batch of samples must be loaded, worked with, the results stored, and so on. Minibatch SGD, on the other hand, is usually intentionally made small enough to be computationally tractable.
Usually, this computational advantage is leveraged by performing many more iterations of SGD, making many more steps than conventional batch gradient descent. This usually results in a model that is very close to that which would be found via batch gradient descent, or better.
The way I like to think of how SGD works is to imagine that I have one point that represents my input distribution. My model is attempting to learn that input distribution. Surrounding the input distribution is a shaded area that represents the input distributions of all of the possible minibatches I could sample. It's usually a fair assumption that the minibatch input distributions are close in proximity to the true input distribution. Batch gradient descent, at all steps, takes the steepest route to reach the true input distribution. SGD, on the other hand, chooses a random point within the shaded area, and takes the steepest route towards this point. At each iteration, though, it chooses a new point. The average of all of these steps will approximate the true input distribution, usually quite well.
Gradient Descent is an iterative method to solve the optimization problem. There is no concept of "epoch" or "batch" in classical gradient decent. The key of gradient decent are
- Update the weights by the gradient direction.
- The gradient is calculated precisely from all the data points.
Stochastic Gradient Descent can be explained as: quick and dirty way to "approximate gradient" from one single data point. If we relax on this "one single data point" to "a subset of data", then the concepts of batch and epoch come.
I have a related answer here (with code and plot for the demo)
How could stochastic gradient descent save time comparing to standard gradient descent?
Best Answer
Generally, in case your data is ordered (see e.g. Mnist data set) SGD will have problems. Also, in case you run through it multiple times (so called epoches) having the same order on each run through will probably lead to problems like finding local minima or slower convergence. So you should randomize on each epoche.
In case of a huge amount of data points it is not recommended to shuffle the whole data set. Rather you can follow another random strategy. It may be enough to randomly sample (without replacement) a mini batch of data points. Subsequently you fit the model on the mini-batch by SGD and repeat this process across more mini-batches until convergence. This procedure will also find the solution and probably will use far less data than the full set of data points (but the latter depends on the type of model and data). It is therefore most of the times the much cheaper procedure.