Gradient Descent – Differences Between Gradient Descent and Batch Gradient Descent Explained

gradientgradient descentoptimization

It seems that batch gradient descent is the traditional gradient descent, except that the objective function is in the form of summation?

Best Answer

Gradient descent takes, at each iteration, all of your data to compute the maximum of your loglikelihood, i.e. it is using, at each step, the actual function that is to be optimized, the loglikelihood. This is the most standard optimization procedure for continuous domain and range. There is nothing stochastic (random) about it.

Batch gradient descent doesn't take all of your data, but rather at each step only some new randomly chosen subset (the "batch") of it. Thus, at each step, another function (different from the actual objective function (the loglikelihood in our case)) is taken to take the gradient of. Different batches result in different functions and thus different gradients at the same parameter vector.

Now, most of the time, those batches are chosen via some kind of random procedure, and that makes the gradients that are computed at each step, random, i.e. stochastic. That's why it is called stochastic gradient descent (SGD).

Doing "batch gradient descent" without any randomness in the choice of the batches is not recommended, it will usually lead to bad results.

Some people refer to online learning as "batch gradient descent", where they use, new batches from a datastream only once, and then throw it away. But this can also be understood as SGD, provided the data stream is not containing some weird regularity.