Solved – Is stochastic gradient descent a complete replacement for gradient descent

generalized linear modelgradient descentstochastic gradient descent

I know the differences between stochastic gradient descent and gradient descent, also the advantages of stochastic gradient descent over gradient descent like the performance of algorithm, not converting to local minima etc..

However can we say that SGD is complete replacement for gradient descent or are there any scenarios where SGD doesn't work well?

Best Answer

As with any algorithm, choosing one over the other comes with some pros and cons.

  • Gradient descent (GD) generally requires the entire set of data samples to be loaded in memory, since it operates on all of them at the same time, while SGD looks at one sample at a time
  • As a result of the above, SGD is better when there are memory limitations, or when used with data that is streaming in
  • Since GD looks at the data as a whole, it doesn't suffer as much from variance in the gradient as SGD does. Trying to combat this variance in SGD (which affects the rate of convergence) is an active area of research, though there are quite a few tricks out there that one can try.
  • GD can make use of vectorization for faster gradient computations, while the iterative process in SGD can be a bottleneck. However, SGD is still preferred over GD for large scale learning problems, because it can potentially reach a specified error threshold faster.

Take a look at this paper: Stocahstic Gradient Descent Tricks by Leon Bottou for more comparisons and tricks on improving SGD performance.

To answer the last part of your question: the paper says SGD doesn't work well when the Hessian is ill-conditioned.