Solved – Batch gradient descent versus stochastic gradient descent

gradient descentoptimizationstochastic gradient descent

Suppose we have some training set $(x_{(i)}, y_{(i)})$ for $i = 1, \dots, m$. Also suppose we run some type of supervised learning algorithm on the training set. Hypotheses are represented as $h_{\theta}(x_{(i)}) = \theta_0+\theta_{1}x_{(i)1} + \cdots +\theta_{n}x_{(i)n}$. We need to find the parameters $\mathbf{\theta}$ that minimize the "distance" between $y_{(i)}$ and $h_{\theta}(x_{(i)})$. Let $$J(\theta) = \frac{1}{2} \sum_{i=1}^{m} (y_{(i)}-h_{\theta}(x_{(i)})^{2}$$

Then we want to find $\theta$ that minimizes $J(\theta)$. In gradient descent we initialize each parameter and perform the following update: $$\theta_j := \theta_j-\alpha \frac{\partial}{\partial \theta_{j}} J(\theta)$$

What is the key difference between batch gradient descent and stochastic gradient descent?

Both use the above update rule. But is one better than the other?

Best Answer

The applicability of batch or stochastic gradient descent really depends on the error manifold expected.

Batch gradient descent computes the gradient using the whole dataset. This is great for convex, or relatively smooth error manifolds. In this case, we move somewhat directly towards an optimum solution, either local or global. Additionally, batch gradient descent, given an annealed learning rate, will eventually find the minimum located in it's basin of attraction.

Stochastic gradient descent (SGD) computes the gradient using a single sample. Most applications of SGD actually use a minibatch of several samples, for reasons that will be explained a bit later. SGD works well (Not well, I suppose, but better than batch gradient descent) for error manifolds that have lots of local maxima/minima. In this case, the somewhat noisier gradient calculated using the reduced number of samples tends to jerk the model out of local minima into a region that hopefully is more optimal. Single samples are really noisy, while minibatches tend to average a little of the noise out. Thus, the amount of jerk is reduced when using minibatches. A good balance is struck when the minibatch size is small enough to avoid some of the poor local minima, but large enough that it doesn't avoid the global minima or better-performing local minima. (Incidently, this assumes that the best minima have a larger and deeper basin of attraction, and are therefore easier to fall into.)

One benefit of SGD is that it's computationally a whole lot faster. Large datasets often can't be held in RAM, which makes vectorization much less efficient. Rather, each sample or batch of samples must be loaded, worked with, the results stored, and so on. Minibatch SGD, on the other hand, is usually intentionally made small enough to be computationally tractable.

Usually, this computational advantage is leveraged by performing many more iterations of SGD, making many more steps than conventional batch gradient descent. This usually results in a model that is very close to that which would be found via batch gradient descent, or better.

The way I like to think of how SGD works is to imagine that I have one point that represents my input distribution. My model is attempting to learn that input distribution. Surrounding the input distribution is a shaded area that represents the input distributions of all of the possible minibatches I could sample. It's usually a fair assumption that the minibatch input distributions are close in proximity to the true input distribution. Batch gradient descent, at all steps, takes the steepest route to reach the true input distribution. SGD, on the other hand, chooses a random point within the shaded area, and takes the steepest route towards this point. At each iteration, though, it chooses a new point. The average of all of these steps will approximate the true input distribution, usually quite well.

Best Answer

Related Solutions

Solved – Stochastic Gradient Descent, Mini-Batch and Batch Gradient Descent

Solved – stochastic gradient descent of ridge regression when regularization parameter is very big

Related Question