Solved – Understanding batch size in neural networks

neural networks

I am trying to understand batch size parameter in neural networks (present in keras, tensorflow and etc). Here is my understanding how neural network works. Assume we have independent vector $x = (x_1, …, x_n )$ and dependent vector $y = (y_1, … y_n)$. We would like to find a function s.t.:
$$f(w, x_i)= y_i.$$
We do so by trying to find minimum (using gradient decent with absolute value loss function) of the following function

$$\|f(w, x_1) – y_1\|= g_1(w), $$
with respect of $w$ at some initial point $w_1$.
Next step is to minimize

$$\|f(w, x_2) – y_2\|= g_2(w), $$
with respect of $w$ at some initial point $w_2=w_1+\Delta$.

Now processing each coordinate makes sense, but what are the benefit of batch processing? Assume that we would like to process $x_1$ and $x_2$ at once. Are we going to do it at some initial point $w_1$, if so, we end up with $\Delta_1$ and $\Delta_2$. Do we apply these deltas consecutively to $w_1$ or do we average them off? What am I missing? I am guessing that batch processing has something to do with programming implementation, but it does link back to above math?

Best Answer

In batch processing, the gradient is evaluated for several different input/output values, with each observation yielding a different vector. We then average together the gradient vectors over all observations in the batch and take a single step in the resulting direction.

The benefit of this is that the gradient estimate using a single point only may be very noisy. By averaging together the gradient over many samples we obtain a less noisy estimate.

I'll also point out that having too large of a batch can also be a bad thing. Sometimes the noise in the gradient direction can kick you out of local minima and help you get to a better solution. This is problem specific, though, so when training a NN, you should experiment with many different batch sizes. This paper goes into further detail about this.

Related Question