In the mini-batch training of a neural network, I heard that an important practice is to shuffle the training data before every epoch. Can somebody explain why the shuffling at each epoch helps?
From the google search, I found the following answers:
- it helps the training converge fast
- it prevents any bias during the training
- it prevents the model from learning the order of the training
But, I have the difficulty of understanding why any of those effects is caused by the random shuffling. Can anybody provide an intuitive explanation?
Best Answer
Note: throughout this answer I refer to minimization of training loss and I do not discuss stopping criteria such as validation loss. The choice of stopping criteria does not affect the process/concepts described below.
The process of training a neural network is to find the minimum value of a loss function $ℒ_X(W)$, where $W$ represents a matrix (or several matrices) of weights between neurons and $X$ represents the training dataset. I use a subscript for $X$ to indicate that our minimization of $ℒ$ occurs only over the weights $W$ (that is, we are looking for $W$ such that $ℒ$ is minimized) while $X$ is fixed.
Now, if we assume that we have $P$ elements in $W$ (that is, there are $P$ weights in the network), $ℒ$ is a surface in a $P+1$-dimensional space. To give a visual analogue, imagine that we have only two neuron weights ($P=2$). Then $ℒ$ has an easy geometric interpretation: it is a surface in a 3-dimensional space. This arises from the fact that for any given matrices of weights $W$, the loss function can be evaluated on $X$ and that value becomes the elevation of the surface.
But there is the problem of non-convexity; the surface I described will have numerous local minima, and therefore gradient descent algorithms are susceptible to becoming "stuck" in those minima while a deeper/lower/better solution may lie nearby. This is likely to occur if $X$ is unchanged over all training iterations, because the surface is fixed for a given $X$; all its features are static, including its various minima.
A solution to this is mini-batch training combined with shuffling. By shuffling the rows and training on only a subset of them during a given iteration, $X$ changes with every iteration, and it is actually quite possible that no two iterations over the entire sequence of training iterations and epochs will be performed on the exact same $X$. The effect is that the solver can easily "bounce" out of a local minimum. Imagine that the solver is stuck in a local minimum at iteration $i$ with training mini-batch $X_i$. This local minimum corresponds to $ℒ$ evaluated at a particular value of weights; we'll call it $ℒ_{X_i}(W_i)$. On the next iteration the shape of our loss surface actually changes because we are using $X_{i+1}$, that is, $ℒ_{X_{i+1}}(W_i)$ may take on a very different value from $ℒ_{X_i}(W_i)$ and it is quite possible that it does not correspond to a local minimum! We can now compute a gradient update and continue with training. To be clear: the shape of $ℒ_{X_{i+1}}$ will -- in general -- be different from that of $ℒ_{X_{i}}$. Note that here I am referring to the loss function $ℒ$ evaluated on a training set $X$; it is a complete surface defined over all possible values of $W$, rather than the evaluation of that loss (which is just a scalar) for a specific value of $W$. Note also that if mini-batches are used without shuffling there is still a degree of "diversification" of loss surfaces, but there will be a finite (and relatively small) number of unique error surfaces seen by the solver (specifically, it will see the same exact set of mini-batches -- and therefore loss surfaces -- during each epoch).
One thing I deliberately avoided was a discussion of mini-batch sizes, because there are a million opinions on this and it has significant practical implications (greater parallelization can be achieved with larger batches). However, I believe the following is worth mentioning. Because $ℒ$ is evaluated by computing a value for each row of $X$ (and summing or taking the average; i.e., a commutative operator) for a given set of weight matrices $W$, the arrangement of the rows of $X$ has no effect when using full-batch gradient descent (that is, when each batch is the full $X$, and iterations and epochs are the same thing).