Solved – Batch processing in a neural network

machine learningneural networks

I am trying to understand how each batch is processed in a neural network.
I understand that if we have a training set $X=\{x_1,…,x_{|X|}\}$ and we specify a batch size of $n$ than the neural network will process $n$ entries of $X$ at a time: $\{x_1,…,x_n\}$, $\{x_{n+1},…,x_{2n}\}$,…

What I fail to understand is what actually happens at each batch iteration. If I have the vectors $X$ and $Y$ for training, where for each $x_i\in X$ we have a corresponding value for expected prediction $y_i\in Y$ and a batch size of $n$, does this mean that the training iteration will associate the sequence $\{x_1,…,x_n\}$ to $y_n$ only? If so, does this mean that the values $\{y_1,…,y_{n-1}\}$ are dismissed? Or is the training process associating each $x_i$ to each $y_i$ and then average the result to $n$? If the latter is true, then why is it so much slower for smaller values of $n$?

Best Answer

If batch consists of $n$ samples, then you use each $x_i$ to predict each $y_i$, calculate the error to judge how much wrong are you about the prediction, and then aggregate the errors for all the samples (average them).

does this mean that the training iteration will associate the sequence $\{x_1,...,x_n\}$ to $y_n$ only? If so, does this mean that the values $\{y_1,...,y_{n-1}\}$ are dismissed?

No. You cannot use $x_i$ to predict $y_j$ for $i\ne j$. If yo have data on car models and want to predict the brand, then information about cars 1, 2, and 3 does not tell you anything about brand of car 4. You use each $x_i$ to predict $y_i$, then use the information about the error you made to update the weights.

Or is the training process associating each $x_i$ to each $y_i$ and then average the result to $n$?

You average the errors, then use them to update the weights. For each of the samples your model uses same weights, then you predict and use the overall error to update the weights.

If the latter is true, then why is it so much slower for smaller values of $n$?

I cannot comment on particular case, but as about computation time, you can run into two kinds of problems: either the batch is too large, or it is too small. If it is too big, it may be the case that you won't be able to train the model since it wouldn't fit your computer's (and graphical card's) memory. It can happen that the data and weights will fit the memory, but not much will be left for computations and it will slow everything down. On another hand, if your sample is too small, then you are wasting resources. With high-quality libraries for statistical computations like TensorFlow etc., they are designed to efficiently carry mathematical operations (especially on GPUs), so doing things like matrix multiplication is easy for them. This means that repeating the same operation multiple times (for-loop over samples) can be slower then calculating everything all-at-once. That is why choosing optimal batch size may save you computation time.