Mini-batch is implemented basically as you describe in 2.
- Epoch starts. We sample and feedforward a minibatch, get the error and backprop it, i.e. update the weights. We repeat this until we have sampled the full data set. Epoch over.
Assuming that the network is minimizing the following objective function:
$$
\frac{\lambda}{2}||\theta||^2 + \frac{1}{n}\sum_{i=1}^n E(x^{(i)}, y^{(i)}, \theta)
$$
This is essentially the weights update step
$$
\theta = (1 - \alpha \lambda) \theta - \alpha \frac{1}{b}\sum_{k=i}^{i+b-1} \frac{\partial E}{\partial \theta}(x^{(k)}, y^{(k)}, \theta)
$$
where the following symbols mean:
$E$ = the error measure (also sometimes denoted as cost measure $J$)
$\theta$ = weights
$\alpha$ = learning rate
$1 - \alpha \lambda$ = weight decay
$b$ = batch size
$x$ = variables
You loop over the consecutive batches (i.e. increment by $b$) and update the weights. This more frequent weight updating combined with vectorization is what allows mini-batch gradient descent to tend to converge more quickly than either generic batch
of stochastic
methods.
I would suggest having some held-out data that forms a validation dataset. You can compute your loss function on the validation dataset periodically (it would probably be too expensive after each iteration, so after each epoch seems to make sense) and stop training once the validation loss has stabilized.
If you're in a purely online setting where you don't have any data ahead of time I suppose you could compute an average loss of the examples in each epoch, and wait for that average loss to converge, but of course that could lead to overfitting...
It looks like Vowpal Wabbit (an online learning system that implements SGD amongst other optimizers) uses a technique called Progressive Cross-Validation which is similar to using a holdout set, but allows you to use more data while training the model, see:
http://hunch.net/~jl/projects/prediction_bounds/progressive_validation/coltfinal.pdf
Vowpal Wabbit has an interesting approach, it computes error metrics after each example, but prints the diagnostics with an exponential backoff, so at first you get frequent updates (to help diagnose early problems), and then less frequent updates as time goes on.
Vowpal Wabbit displays two error metrics, the average progressive loss overall, and the average progressive loss since the last time the diagnostics were printed. You can read some details about the VW diagnostics below:
https://github.com/JohnLangford/vowpal_wabbit/wiki/Tutorial#vws-diagnostic-information
Best Answer
In general you calculate gradient of some error (which most reasonably will be some kind of average of per-row errors) with respect to weights.