Solved – Convergence Criteria for Stochastic Gradient Descent

gradient descentmachine learningneural networks

I am familiar with the update rule in SGD whereby the weights ($\theta$) are updated with the gradients of the cost function ($\nabla Q$) for each sample times the learningrate ($\eta$).

$$ w := w – \eta \nabla Q_{i}(w) $$

Now my question is when to evaluate the stopping criteria. In the neural network algorithms I am looking at, steps/epochs continue until the gradients get beneath a pre-defined threshold (or max number of epochs). Are the gradients evaluated within the sample loop (i.e. within SGD) or after each full pass (i.e. epoch).

In the former, I can imagine a situation where the first random samples from a dataset are all the same class and therefore the SGD converges rapidly to minimize that error. This would result in a model that generalizes very poorly because it has 'learned' to classify everything as the same class. So I suspect this is not the case.

If the latter, how would the gradients be evaluated for the stopping condition? Would the error averaged over the samples? I haven't been able to find anything further online besides how the update function works, nothing on implementing it within a neural network algorithm. I assume this method would be the same for 'mini-batch' gradient descent.

Some pseudo-code to help explain my confusion:

while(epoch < max_epochs && reached.threshold < threshold){

    # SGD
    for(i in N){
        weights -= n*dQ(weights)

        # calculate new gradients threshold
        reached.threshold = ...

        # stop check here???
        if(reached.threshold < threshold) break
    }

    # stop check here???
    # if so, how to evaluate???
    ...

    epoch += 1
}

Best Answer

I would suggest having some held-out data that forms a validation dataset. You can compute your loss function on the validation dataset periodically (it would probably be too expensive after each iteration, so after each epoch seems to make sense) and stop training once the validation loss has stabilized.

If you're in a purely online setting where you don't have any data ahead of time I suppose you could compute an average loss of the examples in each epoch, and wait for that average loss to converge, but of course that could lead to overfitting...

It looks like Vowpal Wabbit (an online learning system that implements SGD amongst other optimizers) uses a technique called Progressive Cross-Validation which is similar to using a holdout set, but allows you to use more data while training the model, see:

http://hunch.net/~jl/projects/prediction_bounds/progressive_validation/coltfinal.pdf

Vowpal Wabbit has an interesting approach, it computes error metrics after each example, but prints the diagnostics with an exponential backoff, so at first you get frequent updates (to help diagnose early problems), and then less frequent updates as time goes on.

Vowpal Wabbit displays two error metrics, the average progressive loss overall, and the average progressive loss since the last time the diagnostics were printed. You can read some details about the VW diagnostics below:

https://github.com/JohnLangford/vowpal_wabbit/wiki/Tutorial#vws-diagnostic-information