Solved – Stochastic Gradient Descent, Mini-Batch and Batch Gradient Descent

deep learninggradient descentneural networksstochastic gradient descent

I was learning the optimization part in deep learning.

Let's take linear regression as a simple example. Let $m$ be the total number of data points in the training set $(X,y)$ and $n$ is the number of iterations.

For (Batch) Gradient Descent (GD), we do

for i in range(n):
    y_hat = np.dot(X, theta)
    theta = theta - alpha * (1.0/m) * np.dot(X.T, y_hat-y)

Here we use vectorization (np.dot(X.T, y_hat-y)) to improve the performance. Basically, this code says, we need to FIRST do something (i.e.,np.dot(X.T, y_hat-y)) with ALL of our training samples, then can we update the parameters so as to move one-step in the parameter space towards the minimum.

Now let's look Stochastic Gradient Descent (SGD)

for i in range(n):
    (X,y) = shuffle((X,y))  # this is a function shuffling
                            # the index of training set
    for j in range(m):
        y_hat_j = np.dot(X[i,:],theta)
        theta = theta - alpha * (y_hat[i]-y[i])*x[i,j]

Now, we can see that when using Stochastic gradient descent, you can update your parameters much faster, i.e., you only need to compute the gradient at one data point in the training set, then you can just update your parameters so as to make a move in the parameter space but this move may not guarantee that you move towards the minimum of the objective function (in contrast to the GD).

However, I found that we still have to go through all the training data points, as you can see this for j in range(m). Then if your $m$ is large, I actually don't see why this SGD is faster than GD? To me, it seems that GD is spenting a lot of time on computing the average of gradient over $m$ samples, then update the parameters once. But SGD is updating the parameters for $m$ times. I thought GD is much more efficient. One possible answer could be: for SGD, the total number of iterations, i.e., the $n_{SGD}$ may be smaller than $n_{GD}$ until the algorithm reach a local minimum. Another possible answer could be: with GD, may be the matrix $(X,Y)$ is just too big to fit into the RAM, and hence, we may need to fetch the data from disk, which is very slow. While for SGD, you can put one sample of the training set in RAM and do the computation.

Also, what does the word "stochastic" for? It seems that we need to loop over all the training data points anyway. If you are using mini-batch, then before that for j in range(m), usually people shuffle the training set. But if you are using SGD, I am not even sure whether this shuffling training set is necessary.

Originally, I was thinking myself that we don't have to loop over all the training data points. So I was thinking that taking out this for j in range(m). So in each iteration for i in range(n), we randomly sample a data point from the training set, then we perform parameter updating (I actually think this logic also works, and I implement this idea and run a few test, not too many, and the algorithm does converge fast). In this sense, I can see "stochastic", but it seems that I was wrong, as in Andrew Ng's deep learning class, I see that we still need for j in range(m) to go through all the training data point.

Best Answer

The fact SGD is usually faster/better than BGD could be understood in the following two aspects:

1) For large datasets, it is common that a portion of datasets resembles another portion in terms of patterns encoded, so going over the whole sets to train every time is a waste.

2)Stochastic gradient descent also enables you to jump from one valley to another. In this sense, the solution is not trapped around a local minimum determined by the initialization of your NN. So in principle you are able to find better trained parameters using SGD compared to BGD.

If I am correct, the state of the art algorithm is the minibatch GD, which combines the benefits of both SGD and BGD.

A good and more thorough discussion on advantages&disadvantages of batch,stochstic,minibatch GD could be found in this paper http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf (starting from page5).