Mini-batch is implemented basically as you describe in 2.
- Epoch starts. We sample and feedforward a minibatch, get the error and backprop it, i.e. update the weights. We repeat this until we have sampled the full data set. Epoch over.
Assuming that the network is minimizing the following objective function:
$$
\frac{\lambda}{2}||\theta||^2 + \frac{1}{n}\sum_{i=1}^n E(x^{(i)}, y^{(i)}, \theta)
$$
This is essentially the weights update step
$$
\theta = (1 - \alpha \lambda) \theta - \alpha \frac{1}{b}\sum_{k=i}^{i+b-1} \frac{\partial E}{\partial \theta}(x^{(k)}, y^{(k)}, \theta)
$$
where the following symbols mean:
$E$ = the error measure (also sometimes denoted as cost measure $J$)
$\theta$ = weights
$\alpha$ = learning rate
$1 - \alpha \lambda$ = weight decay
$b$ = batch size
$x$ = variables
You loop over the consecutive batches (i.e. increment by $b$) and update the weights. This more frequent weight updating combined with vectorization is what allows mini-batch gradient descent to tend to converge more quickly than either generic batch
of stochastic
methods.
A good theoretical analysis of with and without replacement schemas in the context of iterative algorithms based on random draws (which are how many discriminative Deep Neural Networks (DNNs) are trained against) can be found here
In short, it turns out that sampling without replacement, leads to faster convergence than sampling with replacement.
I will give a short analysis here based on the toy example that they provide: Let's say that we want to optimize the following objective function:
$$
x_{\text{opt}} = \underset{x}{\arg\min} \frac{1}{2} \sum_{i=1}^{N}(x - y_i)^2
$$
where the target $y_i \sim \mathcal{N}(\mu, \sigma^2)$. In this example, we are trying to solve for the optimal $x$, given $N$ labels of $y_i$ obviously.
Ok, so if we were to solve for the optimal $x$ in the above directly, then we would take the derivative of the loss function here, set it to 0, and solve for $x$. So for our example above, the loss is
$$L = \frac{1}{2} \sum_{i=1}^{N}(x - y_i)^2$$
and it's first derivative would be:
$$ \frac{\delta L}{\delta x} = \sum_{i=1}^{N}(x - y_i)$$
Setting $ \frac{\delta L}{\delta x}$ to 0 and solving for $x$, yields:
$$
x_{\text{opt}} = \frac{1}{N} \sum_{i=1}^{N} y_i
$$
In other words, the optimal solution is nothing but the sample mean of all the $N$ samples of $y$.
Now, if we couldn't perform the above computation all at once, we would have to do it recursively, via the gradient descent update equation below:
$$
x_i = x_{i-1} - \lambda_i \nabla(f(x_{i-1}))
$$
and simply inserting our terms here yields:
$$
x_{i} = x_{i-1} - \lambda_i (x_{i-1} - y_{i})
$$
If we run the above for all $i \in {1, 2, ... N}$, then we are effectively performing this update without replacement. The question then becomes, can we get also get the optimal value of $x$ in this way? (Remember that the optimal value of $x$ is nothing but the sample mean of $y$). The answer is yes, if you let $\lambda_i = 1/i$. To see, this we expand:
$$
x_{i} = x_{i-1} - \lambda_i (x_{i-1} - y_{i}) \\\
x_{i} = x_{i-1} - \frac{1}{i} (x_{i-1} - y_{i}) \\\
x_{i} = \frac{i x_{i-1} - (x_{i-1} - y_{i})}{i} \\\
x_{i} = \frac{(i - 1)x_{i-1} + y_{i}}{i} \\\
i x_{i} = (i - 1)x_{i-1} + y_{i} \\\
$$
The last equation however is nothing but the formula for the running average! Thus as we loop through the set from $i=1$, $i=2$, etc, all the way to $i=N$, we would have performed our updates without replacement, and our update formula gives us the optimal solution of $x$, which is the sample mean!
$$
N x_{N} = (N - 1)x_{N-1} + y_{N} ==> x_N = \frac{1}{N}\sum_{i=1}^{N} y_i = \mu
$$
In contrast however, if we actually drew with replacement, then while our draws would then be truly independent, the optimized value $x_N$ would be different from the (optimal) mean $\mu$, and the square error would be given by:
$$
\mathop{E}\{(x_N - \mu)^2\}
$$
which is going to be a positive value, and this simple toy example can be extended to higher dimensions. This has the consequence that we would want to perform sampling without replacement as a more optimal solution.
Hope this clarifies it some more!
Best Answer
The typical usage of SGD is that each minibatch is constructed completely at random without replacement. (There are other ways to construct minibatches; for some comparison of alternatives, see Why do neural network researchers care about epochs?)
Suppose that you have 6 samples: $S=\{A, B, C, D, E, F\}$ and minibatch size 2. The first minibatch could be $\{A, C\}$ and the second $\{D, F\}$ and the third $\{B, E\}$. So what's happening is that at the first minibatch is that you're sampling 2 examples without replacement from the set $S$. At the second minibatch, you're sampling 2 examples without replacement from $S\setminus\{A,C\}=\{B,D,E,F\}$. At the third minibatch, you're sampling 2 examples without replacement from $S\setminus\{A,C,D,F\}=\{B,E\}$. At this final step, there is only a single possible minibatch because you have a minibatch size of 2, you're sampling without replacement, and only 2 examples remain.
Now you've exhausted all of your training samples, so the next epoch starts. Epochs are constructed completely at random, so a valid sequence of minibatches is first $\{D, F\}$, second $\{A, B\}$ and third $\{C,E\}$. It's ok that one of the minibatches with $\{D,F\}$ appear in both epoch 1 and epoch 2 -- this happened purely due to randomness. (And in a more realistic usage where more training data is available, the less likely it is that consecutive epochs will contain one or more mini-batches that are identical, unless the mini-batch size is 1.)