Solved – Stochastic Gradient Descent, Mini-Batch and Batch Gradient Descent

deep learninggradient descentneural networksstochastic gradient descent

I was learning the optimization part in deep learning.

Let's take linear regression as a simple example. Let $m$ be the total number of data points in the training set $(X,y)$ and $n$ is the number of iterations.

For (Batch) Gradient Descent (GD), we do

for i in range(n):
    y_hat = np.dot(X, theta)
    theta = theta - alpha * (1.0/m) * np.dot(X.T, y_hat-y)

Here we use vectorization (np.dot(X.T, y_hat-y)) to improve the performance. Basically, this code says, we need to FIRST do something (i.e.,np.dot(X.T, y_hat-y)) with ALL of our training samples, then can we update the parameters so as to move one-step in the parameter space towards the minimum.

Now let's look Stochastic Gradient Descent (SGD)

for i in range(n):
    (X,y) = shuffle((X,y))  # this is a function shuffling
                            # the index of training set
    for j in range(m):
        y_hat_j = np.dot(X[i,:],theta)
        theta = theta - alpha * (y_hat[i]-y[i])*x[i,j]

Now, we can see that when using Stochastic gradient descent, you can update your parameters much faster, i.e., you only need to compute the gradient at one data point in the training set, then you can just update your parameters so as to make a move in the parameter space but this move may not guarantee that you move towards the minimum of the objective function (in contrast to the GD).

However, I found that we still have to go through all the training data points, as you can see this for j in range(m). Then if your $m$ is large, I actually don't see why this SGD is faster than GD? To me, it seems that GD is spenting a lot of time on computing the average of gradient over $m$ samples, then update the parameters once. But SGD is updating the parameters for $m$ times. I thought GD is much more efficient. One possible answer could be: for SGD, the total number of iterations, i.e., the $n_{SGD}$ may be smaller than $n_{GD}$ until the algorithm reach a local minimum. Another possible answer could be: with GD, may be the matrix $(X,Y)$ is just too big to fit into the RAM, and hence, we may need to fetch the data from disk, which is very slow. While for SGD, you can put one sample of the training set in RAM and do the computation.

Also, what does the word "stochastic" for? It seems that we need to loop over all the training data points anyway. If you are using mini-batch, then before that for j in range(m), usually people shuffle the training set. But if you are using SGD, I am not even sure whether this shuffling training set is necessary.

Originally, I was thinking myself that we don't have to loop over all the training data points. So I was thinking that taking out this for j in range(m). So in each iteration for i in range(n), we randomly sample a data point from the training set, then we perform parameter updating (I actually think this logic also works, and I implement this idea and run a few test, not too many, and the algorithm does converge fast). In this sense, I can see "stochastic", but it seems that I was wrong, as in Andrew Ng's deep learning class, I see that we still need for j in range(m) to go through all the training data point.

Best Answer

The fact SGD is usually faster/better than BGD could be understood in the following two aspects:

1) For large datasets, it is common that a portion of datasets resembles another portion in terms of patterns encoded, so going over the whole sets to train every time is a waste.

2)Stochastic gradient descent also enables you to jump from one valley to another. In this sense, the solution is not trapped around a local minimum determined by the initialization of your NN. So in principle you are able to find better trained parameters using SGD compared to BGD.

If I am correct, the state of the art algorithm is the minibatch GD, which combines the benefits of both SGD and BGD.

A good and more thorough discussion on advantages&disadvantages of batch,stochstic,minibatch GD could be found in this paper http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf (starting from page5).

Related Solutions

Solved – Batch gradient descent versus stochastic gradient descent

The applicability of batch or stochastic gradient descent really depends on the error manifold expected.

Batch gradient descent computes the gradient using the whole dataset. This is great for convex, or relatively smooth error manifolds. In this case, we move somewhat directly towards an optimum solution, either local or global. Additionally, batch gradient descent, given an annealed learning rate, will eventually find the minimum located in it's basin of attraction.

Stochastic gradient descent (SGD) computes the gradient using a single sample. Most applications of SGD actually use a minibatch of several samples, for reasons that will be explained a bit later. SGD works well (Not well, I suppose, but better than batch gradient descent) for error manifolds that have lots of local maxima/minima. In this case, the somewhat noisier gradient calculated using the reduced number of samples tends to jerk the model out of local minima into a region that hopefully is more optimal. Single samples are really noisy, while minibatches tend to average a little of the noise out. Thus, the amount of jerk is reduced when using minibatches. A good balance is struck when the minibatch size is small enough to avoid some of the poor local minima, but large enough that it doesn't avoid the global minima or better-performing local minima. (Incidently, this assumes that the best minima have a larger and deeper basin of attraction, and are therefore easier to fall into.)

One benefit of SGD is that it's computationally a whole lot faster. Large datasets often can't be held in RAM, which makes vectorization much less efficient. Rather, each sample or batch of samples must be loaded, worked with, the results stored, and so on. Minibatch SGD, on the other hand, is usually intentionally made small enough to be computationally tractable.

Usually, this computational advantage is leveraged by performing many more iterations of SGD, making many more steps than conventional batch gradient descent. This usually results in a model that is very close to that which would be found via batch gradient descent, or better.

The way I like to think of how SGD works is to imagine that I have one point that represents my input distribution. My model is attempting to learn that input distribution. Surrounding the input distribution is a shaded area that represents the input distributions of all of the possible minibatches I could sample. It's usually a fair assumption that the minibatch input distributions are close in proximity to the true input distribution. Batch gradient descent, at all steps, takes the steepest route to reach the true input distribution. SGD, on the other hand, chooses a random point within the shaded area, and takes the steepest route towards this point. At each iteration, though, it chooses a new point. The average of all of these steps will approximate the true input distribution, usually quite well.

Solved – How could stochastic gradient descent save time compared to standard gradient descent

Short answer:

In many big data setting (say several million data points), calculating cost or gradient takes very long time, because we need to sum over all data points.
We do NOT need to have exact gradient to reduce the cost in a given iteration. Some approximation of gradient would work OK.
Stochastic gradient decent (SGD) approximate the gradient using only one data point. So, evaluating gradient saves a lot of time compared to summing over all data.
With "reasonable" number of iterations (this number could be couple of thousands, and much less than the number of data points, which may be millions), stochastic gradient decent may get a reasonable good solution.

Long answer:

My notation follows Andrew NG's machine learning Coursera course. If you are not familiar with it, you can review the lecture series here.

Let's assume regression on squared loss, the cost function is

\begin{align} J(\theta)= \frac 1 {2m} \sum_{i=1}^m (h_{\theta}(x^{(i)})-y^{(i)})^2 \end{align}

and the gradient is

\begin{align} \frac {d J(\theta)}{d \theta}= \frac 1 {m} \sum_{i=1}^m (h_{\theta}(x^{(i)})-y^{(i)})x^{(i)} \end{align}

for gradient decent (GD), we update the parameter by

\begin{align} \theta_{new} &=\theta_{old} - \alpha \frac 1 {m} \sum_{i=1}^m (h_{\theta}(x^{(i)})-y^{(i)})x^{(i)} \end{align}

For stochastic gradient decent we get rid of the sum and $1/m$ constant, but get the gradient for current data point $x^{(i)},y^{(i)}$, where comes time saving.

\begin{align} \theta_{new}=\theta_{old} - \alpha \cdot (h_{\theta}(x^{(i)})-y^{(i)})x^{(i)} \end{align}

Here is why we are saving time:

Suppose we have 1 billion data points.

In GD, in order to update the parameters once, we need to have the (exact) gradient. This requires to sum up these 1 billion data points to perform 1 update.
In SGD, we can think of it as trying to get an approximated gradient instead of exact gradient. The approximation is coming from one data point (or several data points called mini batch). Therefore, in SGD, we can update the parameters very quickly. In addition, if we "loop" over all data (called one epoch), we actually have 1 billion updates.

The trick is that, in SGD you do not need to have 1 billion iterations/updates, but much less iterations/updates, say 1 million, and you will have "good enough" model to use.

I am writing a code to demo the idea. We first solve the linear system by normal equation, then solve it with SGD. Then we compare the results in terms of parameter values and final objective function values. In order to visualize it later, we will have 2 parameters to tune.

    set.seed(0);n_data=1e3;n_feature=2;
    A=matrix(runif(n_data*n_feature),ncol=n_feature)
    b=runif(n_data)
    res1=solve(t(A) %*% A, t(A) %*% b)

    sq_loss<-function(A,b,x){
      e=A %*% x -b
      v=crossprod(e)
      return(v[1])
    }

    sq_loss_gr_approx<-function(A,b,x){
      # note, in GD, we need to sum over all data
      # here i is just one random index sample
      i=sample(1:n_data, 1)
      gr=2*(crossprod(A[i,],x)-b[i])*A[i,]
      return(gr)
    }

    x=runif(n_feature)
    alpha=0.01
    N_iter=300
    loss=rep(0,N_iter)

    for (i in 1:N_iter){
      x=x-alpha*sq_loss_gr_approx(A,b,x)
      loss[i]=sq_loss(A,b,x)
    }

The results:

as.vector(res1)
[1] 0.4368427 0.3991028
x
[1] 0.3580121 0.4782659

Note, although the parameters are not too close, the loss values are $124.1343$ and $123.0355$ which are very close.

Here is the cost function values over iterations, we can see it can effectively decrease the loss, which illustrates the idea: we can use a subset of data to approximate the gradient and get "good enough" results.

Now let's check the computational efforts between two approaches. In the experiment, we have $1000$ data points, using SD, evaluate gradient once needs to sum over them data. BUT in SGD, sq_loss_gr_approx function only sum up 1 data point, and overall we see, the algorithm converges less than $300$ iterations (note, not $1000$ iterations.) This is the computational savings.

Best Answer

Related Solutions

Solved – Batch gradient descent versus stochastic gradient descent

Solved – How could stochastic gradient descent save time compared to standard gradient descent

Related Question