[Math] How to derive the back propagation formula in a more elegant way

clifford-algebrasmachine learningneural networkstensors

When you compute the gradient of the cost function of a neural network with respect to its weights, as I currently understand it, you can only do it by computing the partial derivative of the cost function with respect to each one of the weights individually, and then you put them in a big vector. The reason I'm computing the gradient, is because I want to derive the backpropagation formula.

Specifically, if $J$ is the cost function of the neural network, and $\Theta^{(l)}_{ij}$ is the weight that the activation of neuron $i$ in layer $l-1$ has in computing the activation of of neuron $j$ in layer $l$, then as I currently understand it, in order to compute the gradient of the cost function you have to compute $\frac{\partial J}{\partial \Theta^{(l)}_{ij}}$, for each value of $i$, $j$ and $l$. However in doing this you have to keep track of these three indices, and I find this inelegant enough that I'm willing to ask the following question.

Is there a way of deriving the backpropagation formula, that allows you to keep track of the neural network weights as a single object?

For example, with logistic regression, you have a vector of weights $\mathbf{w}$ and the maximum liklihood cost function is:

\begin{equation}
J(\mathbf{w}) = \frac{1}{N}\sum_{i = 0}^{N}\ln\left(1 + e^{-y_i\mathbf{w}^{T}\mathbf{x}_i}\right)
\end{equation}

and you can compute the gradient relatively straightforwardly:
\begin{align*}
\nabla_{\mathbf{w}} J &= \frac{1}{N} \sum_{i = 0}^{N}\frac{-y_i\mathbf{x}_i}{1 + e^{-y_i\mathbf{w}^{T}\mathbf{x}_i}}
\end{align*}

You don't need an extra index for each component of the weights $\mathbf{w}$ because, you are putting them in a single vector. I would like to derive the back propagation formula in a way, that similar to the above derivation of the gradient of the logistic regression cost function, doesn't involve indices for each component of the neural network weights, how can I do this? I think that such frameworks as geometric algebra and tensors might be helpful here, so that why I've included them in the tags.

Best Answer

As much as I love Clifford / geometric algebra, I don't think it is of use here. CA allows you to deal in a basis-free way with multidimensional subspaces as algebraic objects, making various complicated derivations more elegant and transparent. In the case of neural nets, however, we really are dealing with a plain vector of parameters. If $\theta \in \mathbb{R}^D$ is a list of parameters, it is indeed a vector because the sum of two parameter vectors is again a valid parameter vector and a scalar times a parameter vector is also a valid parameter vector. This vector does not naturally possess any additional structure that would make it into a multivector.

However, standard Linear Algebra, as practiced by mathematicians, does allow you to do basis-free computations with vectors. This can indeed make the derivation of backpropagation more transparent.

The essence of backprop

Basically, the backpropagation algorithm is an efficient way to compute the gradient of a function that is the composition of a number of functions: $$ L_\theta(x) = (f_N \circ f_{N-1} \circ \cdots \circ f_1)(x)$$ $L$ is called the loss, and it is just a chain of functions $f_i : \mathbb{R}^{D_{i-1}} \times \mathbb{R}^{P_i} \rightarrow \mathbb{R}^{D_i}$, of a $P_i$ dimensional parameter vector $\theta_i \in \mathbb{R}^{P_i}$ and the output $f_{i-1} \in \mathbb{R}^{D_{i-1}}$ of the previous layer.

Without choosing a basis, we see (using the total derivative) that the gradient of $L$ wrt the whole parameter vector $\theta$ is

$$ \begin{equation} \begin{aligned} \frac{d}{d \theta} f_N \circ \cdots \circ f_1 (x) \; &= \frac{\partial f_N}{\partial \theta_N} \frac{d \theta_N}{d \theta} + \frac{\partial f_N}{\partial f_{N-1}} \frac{d f_{N-1}}{d \theta} \\ &= \frac{\partial f_N}{\partial \theta_N} \frac{d \theta_N}{d \theta} + \frac{\partial f_N}{\partial f_{N-1}} \frac{\partial f_{N-1}}{\partial \theta_{N-1}} \frac{d \theta_{N-1}}{d\theta} + \frac{\partial f_N}{\partial f_{N-1}} \frac{\partial f_{N-1}}{\partial f_{N-2}} \frac{d f_{N-2}}{d\theta}\\ &= \cdots \end{aligned} \end{equation} $$ Do you see the pattern developing? You get products of the form $\frac{\partial f_N}{\partial f_{N-1}} \frac{\partial f_{N-1}}{\partial f_{N-2}} \cdots \frac{d f_{2}}{d f_1}$. Each factor in that product requires the output of the previous function (layer), so a dumb algorithm would recompute $f_i$ for each term that requires it. A better approach is to do a "forward pass" where we compute $f_1(x)$, remember it, then use it to compute $f_2(f_1(x))$, remember it, and so on.

Then there is the question of how to compute $\frac{\partial f_N}{\partial f_{N-1}} \frac{\partial f_{N-1}}{\partial f_{N-2}} \cdots \frac{d f_{2}}{d f_1}$ once we have all the values of $f_i$. Once we choose a basis (and we eventually have to in order to get numerical values), the factors $\frac{\partial f_{i}}{\partial f_{i-1}}$ are matrices, so we're talking about a big matrix multiplication. For example, if a mid-layer maps $\mathbb{R}^{1000}$ to $\mathbb{R}^{5000}$, then $\frac{\partial f_{i}}{\partial f_{i-1}}$ is a $5000 \times 1000$ matrix. The key insight of backprop is that the whole function $L$ has signature $L : \mathbb{R}^{D_0} \rightarrow \mathbb{R}$, so its much better to compute $$ \left(\left(\left( \frac{\partial f_N}{\partial f_{N-1}} \cdots \frac{\partial f_{3}}{\partial f_2} \right) \frac{\partial f_{2}}{\partial f_1} \right) \frac{d f_1}{d \theta}\right) $$ than $$ \left(\frac{\partial f_N}{\partial f_{N-1}} \cdots \left(\frac{\partial f_{3}}{\partial f_2} \left(\frac{\partial f_{2}}{\partial f_1} \frac{d f_1}{d \theta}\right)\right)\right) $$

The reason is that in the first case, we perform a sequence of matrix-vector multiplications. The result of the vector-matrix product $\frac{\partial f_{N}}{\partial f_{N-1}} \frac{\partial f_{N-1}}{\partial f_{N-2}}$ is again a vector, which we then multiply by $\frac{\partial f_{N-2}}{\partial f_{N-3}}$, and so on. The cost of each multiplication is $O(D^2)$. For the second case, one computes matrix-matrix products, which has complexity $O(D^3)$.

So we have established that backprop is the exploitation of associativity. In this light backprop seems like a trivial algorithm (compute the product in the right order..), but back in the 80s you could write a paper about it and get very famous :)

Related Question