Solved – Backward propagation without a loop over training examples

backpropagationmachine learningMATLABneural networks

At this point in a Coursera Machine Learning lecture on implementing neural networks, Andrew Ng says:

To implement back-prop, usually we will do that with a for-loop over
the training examples.

Some of you may have heard of
frankly very advanced
vectorization methods where you don't
have a for-loop over the $m$ training
examples

The language used in the course is Octave / MATLAB.

How does one implement vectorised back-prop without a loop over the training examples?

Best Answer

Reading the quote, it's not clear to me what exactly he means by 'loop'. But, we can consider a couple possibilities. As an example, say we want to compute the dot product between two vectors, $x$ and $y$. A naive way to do this would be to write a loop in some high level programming language to manually add up the products of the elements. In pseudo-code:

result = 0
for i = 1 to n:
    result = result + x[i] * y[i]

A more efficient approach would be to use numerical linear algebra library that implements dot products. The code might look something like:

result = dot(x, y)

In the context of high level, interpreted languages like Python, Matlab, etc. this would be called 'vectorized code'. The code we've written no longer contains a loop, although looping might still be happening at a lower level (I'll get to this in a second). In an interpreted language, the naive code would cause the interpreter to repeatedly execute the instructions in the loop. To do its job, the interpreter must perform additional computations beyond the mathematical operations we're interested in. Doing this repeatedly over the loop incurs a lot of overhead (but a smarter interpreter might use a JIT compiler to avoid this). In the vectorized code, the interpreter would pass the entire operation to the linear algebra library, which can execute it much more efficiently. In compiled languages like C, the high level code would be compiled to machine code, avoiding the overhead of an interpreter. But, using a numerical linear algebra library (e.g. BLAS) will still be more efficient than the naive code because it's highly optimized and can accelerate the computation using special features of the hardware.

The vectorized code doesn't explicitly contain loops. But, as Matthew Drury pointed out in the comments, looping might still be happening at a lower level (e.g. as the computer steps through the operations in the linear algebra library's machine code). The loop(s) won't look exactly like the naive code because much of the efficiency of numerical computing libraries comes from executing multiple operations simultaneously (i.e. parallelism). This can be achieved by taking advantage of special CPU instructions, using multiple CPU cores, or even multiple networked machines or specialized hardware.

At a fundamental level, we can consider a loop to be the repeated, sequential execution of identical operations in time. So, is it possible to perform the computation without any loops whatsoever? In principle, the answer is yes (for some problems), by using parallelism and executing all operations simultaneously. Let's ignore the specifics of existing hardware. In the dot product example, imagine we made a custom piece of hardware (e.g. an analog or digital circuit) with the following logical structure. It would compute dot products without any loops taking place:

enter image description here

Not all computations can be parallelized. For example, sometimes a later part of the computation depends on an earlier result. Some problems can be broken into smaller pieces that can be parallelized. Getting back to the original question about backprop, the gradient is a sum of terms computed for each training point. These terms don't depend on each other, so can be computed in parallel. If the number of terms exceeds the number of processing units, some amount of looping would be required, but fewer iterations of the loop would be needed.

In practice, high performance neural nets are often implemented using libraries like Theano and TensorFlow. These libraries allow one to describe the structure and operation of the network at a somewhat abstract level. They compile these descriptions into efficient code in a way that takes care of details like computing gradients (see automatic differentiation). They can achieve parallelism in multiple ways, including using specialized CPU instructions, multiple CPU cores, multiple networked machines, and GPUs. A current research topic and emerging trend in industry is to accelerate neural nets using custom hardware.