when should I reset the state of the LSTMs?
Typically, for each new input, i.e. for each sample.
how to feed the Network with a mini batch?
Typically, samples are padded so that all samples in a mini batch have the same length, for programming and performance reasons.
Batch normalisation is designed specifically to correct for batch-wise effects. You mention grades, which, if they are exam grades, are a classic example of where it may have benefit. Each year the set of students sit the same questions, but these questions differ from year to year. If we assume that year to year (ignoring long term variation due to education reforms etc, that is another ball game) the student population has a consistent IQ, we then expect that differences in raw test scores from year to year reflect differences in question difficulty. Note this is an assumption and as such open to challenge, but it is this assumption that leads to using batch normalisation.
Batch normalising allows correcting for year to year differences in test difficulty.
If the batches are distinct, sufficiently large to provide reliable estimates of batch effects and categorical rather than arbitrary divisions of a continuous process, then the model is being trained to work on batch corrected data and as long as the methodology is consistent then it can be applied to new data. We apply the model with the implict assumption that the batch wise effects continue to crop up in new years.
Batch normalisation is not a great solution for subsampling a continous population, while it may often converge over many iterations to typical behaviour, there will be many local inconsistencies. If the data is not in clear batches, it would be better using methods designed for segmenting continous processes (moving averages, detrending, splines etc)
How does this sort of transformation not break down training in its entirety?
If it is appropriate, what will happen is that it corrects irrelevant batch wise errors, making your data more informative and consistent which makes training better by removing irrelevant noise. If it is not relevant it will break it (although it may do so in non obvious ways, so it may look like it's working during training)
Even though you likely normalized your features before-hand, aren't mini-batches going to end up having significantly differing means and variances that end up throwing everything off?
The point is that batches end up with comparable means and variances after correction. This then makes the batches comparable.
A feature being equal to 0.1 in one batch might mean something completely different than that feature being equal to 0.1 in another.
Yes, this is the point of batch normalisation, to standardise the scales between batches so we can then compare values after correcting for batch wise effects
Best Answer
A good theoretical analysis of with and without replacement schemas in the context of iterative algorithms based on random draws (which are how many discriminative Deep Neural Networks (DNNs) are trained against) can be found here
In short, it turns out that sampling without replacement, leads to faster convergence than sampling with replacement.
I will give a short analysis here based on the toy example that they provide: Let's say that we want to optimize the following objective function:
$$ x_{\text{opt}} = \underset{x}{\arg\min} \frac{1}{2} \sum_{i=1}^{N}(x - y_i)^2 $$
where the target $y_i \sim \mathcal{N}(\mu, \sigma^2)$. In this example, we are trying to solve for the optimal $x$, given $N$ labels of $y_i$ obviously.
Ok, so if we were to solve for the optimal $x$ in the above directly, then we would take the derivative of the loss function here, set it to 0, and solve for $x$. So for our example above, the loss is
$$L = \frac{1}{2} \sum_{i=1}^{N}(x - y_i)^2$$
and it's first derivative would be:
$$ \frac{\delta L}{\delta x} = \sum_{i=1}^{N}(x - y_i)$$
Setting $ \frac{\delta L}{\delta x}$ to 0 and solving for $x$, yields:
$$ x_{\text{opt}} = \frac{1}{N} \sum_{i=1}^{N} y_i $$
In other words, the optimal solution is nothing but the sample mean of all the $N$ samples of $y$.
Now, if we couldn't perform the above computation all at once, we would have to do it recursively, via the gradient descent update equation below:
$$ x_i = x_{i-1} - \lambda_i \nabla(f(x_{i-1})) $$
and simply inserting our terms here yields:
$$ x_{i} = x_{i-1} - \lambda_i (x_{i-1} - y_{i}) $$
If we run the above for all $i \in {1, 2, ... N}$, then we are effectively performing this update without replacement. The question then becomes, can we get also get the optimal value of $x$ in this way? (Remember that the optimal value of $x$ is nothing but the sample mean of $y$). The answer is yes, if you let $\lambda_i = 1/i$. To see, this we expand:
$$ x_{i} = x_{i-1} - \lambda_i (x_{i-1} - y_{i}) \\\ x_{i} = x_{i-1} - \frac{1}{i} (x_{i-1} - y_{i}) \\\ x_{i} = \frac{i x_{i-1} - (x_{i-1} - y_{i})}{i} \\\ x_{i} = \frac{(i - 1)x_{i-1} + y_{i}}{i} \\\ i x_{i} = (i - 1)x_{i-1} + y_{i} \\\ $$
The last equation however is nothing but the formula for the running average! Thus as we loop through the set from $i=1$, $i=2$, etc, all the way to $i=N$, we would have performed our updates without replacement, and our update formula gives us the optimal solution of $x$, which is the sample mean!
$$ N x_{N} = (N - 1)x_{N-1} + y_{N} ==> x_N = \frac{1}{N}\sum_{i=1}^{N} y_i = \mu $$
In contrast however, if we actually drew with replacement, then while our draws would then be truly independent, the optimized value $x_N$ would be different from the (optimal) mean $\mu$, and the square error would be given by:
$$ \mathop{E}\{(x_N - \mu)^2\} $$
which is going to be a positive value, and this simple toy example can be extended to higher dimensions. This has the consequence that we would want to perform sampling without replacement as a more optimal solution.
Hope this clarifies it some more!