[Math] Reverse mode differentiation vs. forward mode differentiation – where are the benefits

derivatives

According to wiki forward mode differentiation is preferred when $f: \mathbb{R}^n \mapsto \mathbb{R}^m$, m >> n. I cannot see any computational benefits. Let us take simple example: $f(x,y) = sin(xy)$. We can visualize it
as graph with four nodes and 3 edges. Top node is $sin(xy)$, node one level below is $xy$ and two initial nodes are $x$ and $y$. Derivatives on nodes are $cos(xy)$, $x$, and $y$. For both reverse and forward mode differentiation we have to compute these derivatives. How is reverse mode differentiation is computationally superior here?

Best Answer

The chain rule states that to compute the Jacobian of an operation we should multiply the Jacobians of all sub-operations together. The difference between forward and reverse differentiation is the order in which we multiply those Jacobians.

In your case you only have two sub-operations: $xy$ and $\sin()$, leading to only one matrix multiplication, so it isn't really instructive. However, let's consider an operation with 3 sub-operations. Take the function: $$ \mathbf{y} = h(g(f(\mathbf{x}))) $$ where $\bf{x}$ and $\bf{y}$ are vectors of different lengths. We can break this down into: $$ \mathbf{a} = f(\mathbf{x}),~~~~ \mathbf{b} = g(\mathbf{a}),~~~~ \mathbf{y} = h(\mathbf{b}). $$ This gives us the Jacobian $$ \underbrace{\frac{\partial \mathbf{y}}{\partial \mathbf{x}}}_{|\mathbf{y}|\times|\mathbf{x}|} = \underbrace{\frac{\partial h(\mathbf{b})}{\partial \mathbf{b}}}_{|\mathbf{y}|\times|\mathbf{b}|} \underbrace{\frac{\partial g(\mathbf{a})}{\partial \mathbf{a}}}_{|\mathbf{b}|\times|\mathbf{a}|} \underbrace{\frac{\partial f(\mathbf{x})}{\partial \mathbf{x}}}_{|\mathbf{a}|\times|\mathbf{x}|}, $$ with the size of each matrix noted below the matrix. The time taken to compute each of those intermediate Jacobians is fixed, but the order in which we multiply them changes the number of operations required to do so. Forward differentiation would compute $$ \frac{\partial \mathbf{y}}{\partial \mathbf{x}} = \frac{\partial h(\mathbf{b})}{\partial \mathbf{b}}\left(\frac{\partial g(\mathbf{a})}{\partial \mathbf{a}}\frac{\partial f(\mathbf{x})}{\partial \mathbf{x}}\right), $$ which involves $|\mathbf{x}|\cdot|\mathbf{a}|\cdot|\mathbf{b}|+|\mathbf{x}|\cdot|\mathbf{b}|\cdot|\mathbf{y}|$ multiplications*. In contrast, reverse differentiation would compute $$ \frac{\partial \bf{y}}{\partial \bf{x}} = \left(\frac{\partial h(\bf{b})}{\partial \bf{b}}\frac{\partial g(\bf{a})}{\partial \bf{a}}\right)\frac{\partial f(\bf{x})}{\partial \bf{x}}, $$ which involves $|\mathbf{y}|\cdot|\mathbf{a}|\cdot|\mathbf{b}|+|\mathbf{y}|\cdot|\mathbf{a}|\cdot|\mathbf{x}|$ multiplications.

Assuming for simplicity that $|\mathbf{a}|=|\mathbf{b}|$, in the case that $|\mathbf{y}|\gt|\mathbf{x}|$, we can see that forward differentiation results in fewer operations. Similarly, if $|\mathbf{y}|\lt|\mathbf{x}|$ then reverse differentiation results in fewer operations.

Extending this reasoning to a longer chain of Jacobians, we can see that the optimal ordering of matrix multiplications need not be fully forwards or reverse differentiation, but that hybrid schemes can also result in fewer operations.


*The number of scalar multiplications required to multiply two matrices of sizes $a\times b$ and $b\times c$ is $a\cdot b\cdot c$.
Related Question