[Math] chain rule using tree diagram, why does it work

calculusfunctionsgraph theory

In multivariable calculus, I was taught to compute the chain rule by drawing a "tree diagram" (a directed acyclic graph) representing the dependence of one variable on the others. I now want to understand the theory behind it.

Examples:
Let $y$ and $x$ both be functions of $t$.
Let $z$ be a function of both $x$ and $y$.

The derivative of z with respect to t is:
$$\frac{dz}{dt} = \frac{\partial z}{\partial x} \frac{dx}{dt} + \frac{\partial z}{\partial y} \frac{dy}{dt}$$

To compute this derivative, I was taught to draw a graph with the following edges:
$x \to z$, $y \to z$, $t \to x$, and $t \to y$.
Source: http://www.math.hmc.edu/calculus/tutorials/multichainrule/

alt text

These tree diagrams can be constructed for arbitrarily complex functions with many variables.

In general, to find a derivative of a dependent variable with respect to an independent variable, you need to take the sum of all of the different paths to reach the dependent variable from the independent variable. Traveling down a path, you multiply the functions (e.g. $\frac{\partial z}{\partial x} \cdot \frac{dx}{dt}$).

Why does this work?

Best Answer

The point of derivatives in one variable is to provide linear approximations $f(x) = f(p) + f'(p) (x - p) + o(|x - p|)$ to nice functions. Multivariate derivatives work the same way, except "linear approximation" here means approximation by a general linear transformation (a matrix) instead of a scalar.

This is made precise by the following definition: we say that a function $f : \mathbb{R}^n \to \mathbb{R}^m$ has total derivative a linear transformation $df_p : \mathbb{R}^n \to \mathbb{R}^m$ at a point $p$ if there exists $\epsilon > 0$ and a function $E_p(h)$ defined for $|h| < \epsilon$ such that

$$f(p + h) = f(p) + df_p(h) + |h| E_p(h)$$

where $\lim_{h \to 0} E_p(h) = 0$. The matrix $df_p$ is sometimes called the Jacobian. In little-o notation, we write this

$$f(p + h) = f(p) + df_p(h) + o(|h|).$$

This might seem unnecessarily complicated, but it is the key to understanding the multivariate chain rule. Suppose that in addition to $f$ we have another function $g : \mathbb{R}^m \to \mathbb{R}^k$ with a total derivative $dg_q$ at some point $q$, and suppose that $f(p) = q$. Then

$$gf(p + h) = g \left( f(p) + df_p(h) + o(|h|) \right) = gf(p) + dg_q df_p(h) + o(|h|)$$

or, in other words,

The total derivative $d(gf)_p$ of $gf$ at $p$ is the (matrix) product of the total derivatives $dg_q$ and $df_p$.

This is the most general statement of the multivariate chain rule. The relationship to tree diagrams is that one can model matrix multiplication using composition of incidence matrices, which come from graphs depicting incidence relationships between sets.

In your particular example, you have a function $t \mapsto (x, y) : \mathbb{R}^1 \to \mathbb{R}^2$ and another function $(x, y) \mapsto z : \mathbb{R}^2 \to \mathbb{R}^1$. The total derivative of the first function is $\left[ \begin{array}{c} \frac{dx}{dt} \\\ \frac{dy}{dt} \end{array} \right]$ and the total derivative of the second function is $\left[ \frac{dz}{dx}, \frac{dz}{dy} \right]$, so the total derivative of their composition is the product

$$\frac{dz}{dt} = \left[ \frac{dz}{dx}, \frac{dz}{dy} \right] \left[ \begin{array}{c} \frac{dx}{dt} \\\ \frac{dy}{dt} \end{array} \right]$$

and this is precisely the formula you give. The connection to diagrams is that one can represent a composition of linear transformations $\mathbb{R}^1 \to \mathbb{R}^2$ and $\mathbb{R}^2 \to \mathbb{R}^1$ using a pair of incidence matrices, one to represent incidences between a $1$-element set and a $2$-element set, and the other to represent incidences between that $2$-element set and another $1$-element set.

Related Question