Solved – the difference in “weight update process” in gradient descent vs Stochastic gradient descent

backpropagationdeep learningderivativegradient descentneural networks

Question

  1. In normal GD the weights are updated for every row in the training
    dataset while in SGD the weights are updated only once for the mini
    batch based on cummulative dLoss/dw1, dLoss/dw2 . Is my
    understanding correct?
  2. Does updating weights only once for m examples based on cumulative
    dLoss/dw1, dLoss/dw2 give same result as updating weights for every
    row in training data? But just that SGD is faster?

My question is with reference to the below Back propagation algorithm from Coursera Deep learning course by Andre NG

My question is with reference to the below Back propagation algorithm from Coursera Deep learning course by Andrew NG

Best Answer

In general you calculate gradient of some error (which most reasonably will be some kind of average of per-row errors) with respect to weights.

  1. Yes, but the error is not accumulated, only averaged. When using full dataset you calculate average over the whole dataset. In minibatch SGD you do this for the minibatch, so you can treat this error and gradient respectively as estimators of 'real' error and gradient.
  2. No, but if your estimate of error is good enough you might not care, since using minibatches can be much faster. For a concrete example see this notebook (minibatch SGD plot has oscillations, since the samples might not be representative of the whole distribution).