Solved – Should training samples randomly drawn for mini-batch training neural nets be drawn without replacement

deep learningmachine learningneural networksoptimization

We define an epoch as having gone through the entirety of all available training samples, and the mini-batch size as the number of samples over which we average to find the updates to weights/biases needed to descend the gradient.

My question is whether we should draw without replacement from the set of training examples in order to generate each mini-batch within an epoch. I feel like we should avoid replacement to ensure we actually "draw all the samples" to meet the end-of-epoch requirement, but am having trouble finding a definitive answer one way or another.

I've tried googling and reading Ch. 1 of Nielsen's Neural Networks and Deep Learning but have not found a clear answer. In that text Nielsen does not specify that the random sampling be done without replacement, but seems to imply that it is.

A clearer formalization of training in epochs can be found here if desired – https://stats.stackexchange.com/a/141265/131630

Edit: this question seemed similar to me but it was unclear how to apply the fact that linearity of expectation is indifferent to independence to this situation – Should sampling happen with or without replacement

Best Answer

A good theoretical analysis of with and without replacement schemas in the context of iterative algorithms based on random draws (which are how many discriminative Deep Neural Networks (DNNs) are trained against) can be found here

In short, it turns out that sampling without replacement, leads to faster convergence than sampling with replacement.

I will give a short analysis here based on the toy example that they provide: Let's say that we want to optimize the following objective function:

$$ x_{\text{opt}} = \underset{x}{\arg\min} \frac{1}{2} \sum_{i=1}^{N}(x - y_i)^2 $$

where the target $y_i \sim \mathcal{N}(\mu, \sigma^2)$. In this example, we are trying to solve for the optimal $x$, given $N$ labels of $y_i$ obviously.

Ok, so if we were to solve for the optimal $x$ in the above directly, then we would take the derivative of the loss function here, set it to 0, and solve for $x$. So for our example above, the loss is

$$L = \frac{1}{2} \sum_{i=1}^{N}(x - y_i)^2$$

and it's first derivative would be:

$$ \frac{\delta L}{\delta x} = \sum_{i=1}^{N}(x - y_i)$$

Setting $ \frac{\delta L}{\delta x}$ to 0 and solving for $x$, yields:

$$ x_{\text{opt}} = \frac{1}{N} \sum_{i=1}^{N} y_i $$

In other words, the optimal solution is nothing but the sample mean of all the $N$ samples of $y$.

Now, if we couldn't perform the above computation all at once, we would have to do it recursively, via the gradient descent update equation below:

$$ x_i = x_{i-1} - \lambda_i \nabla(f(x_{i-1})) $$

and simply inserting our terms here yields:

$$ x_{i} = x_{i-1} - \lambda_i (x_{i-1} - y_{i}) $$

If we run the above for all $i \in {1, 2, ... N}$, then we are effectively performing this update without replacement. The question then becomes, can we get also get the optimal value of $x$ in this way? (Remember that the optimal value of $x$ is nothing but the sample mean of $y$). The answer is yes, if you let $\lambda_i = 1/i$. To see, this we expand:

$$ x_{i} = x_{i-1} - \lambda_i (x_{i-1} - y_{i}) \\\ x_{i} = x_{i-1} - \frac{1}{i} (x_{i-1} - y_{i}) \\\ x_{i} = \frac{i x_{i-1} - (x_{i-1} - y_{i})}{i} \\\ x_{i} = \frac{(i - 1)x_{i-1} + y_{i}}{i} \\\ i x_{i} = (i - 1)x_{i-1} + y_{i} \\\ $$

The last equation however is nothing but the formula for the running average! Thus as we loop through the set from $i=1$, $i=2$, etc, all the way to $i=N$, we would have performed our updates without replacement, and our update formula gives us the optimal solution of $x$, which is the sample mean!

$$ N x_{N} = (N - 1)x_{N-1} + y_{N} ==> x_N = \frac{1}{N}\sum_{i=1}^{N} y_i = \mu $$

In contrast however, if we actually drew with replacement, then while our draws would then be truly independent, the optimized value $x_N$ would be different from the (optimal) mean $\mu$, and the square error would be given by:

$$ \mathop{E}\{(x_N - \mu)^2\} $$

which is going to be a positive value, and this simple toy example can be extended to higher dimensions. This has the consequence that we would want to perform sampling without replacement as a more optimal solution.

Hope this clarifies it some more!