Calculating gradient of scalar with vector jacobian products

derivativesjacobianmatrix-calculus

I am trying to calculate the $\phi$ gradient of the scalar
\begin{align*}
F = \sum_{i=0}^nf^T\dfrac{\partial h_i}{\partial \theta}^T\lambda_i \tag{1}
\end{align*}

$f, \theta \in \mathcal{R}^{m}$, $h_i \in \mathcal{R}^k$, $h_i$ is a shorthand for $h_i(x_i, \theta)$ where $x_i \in \mathcal{R}^k$. $\lambda_i \in \mathcal{R}^k$ and are defined by
\begin{align*}
&\lambda_{n} = 0 \\
&\lambda_{i-1} = \dfrac{\partial h_i}{\partial x_i}^T\lambda_i + \dfrac{\partial p_i}{\partial x_i}^T\beta_i(\phi)
\end{align*}

where $\beta_i$ and $p_i = p_i(x_i)$ are scalars.

Supposedly, using vector-jacobian products (i.e. reverse mode differentiation) the gradient $\nabla_{\phi}F$ can be calculated in some constant times complexity as $F$. However, after much trying I am unable to express the gradient using only vector-jacobian products. I can only get it down to either matrix-jacobian products, or jacobian-vector products. Either way, this means calculating the gradient would scale with the dimension of $x$, meaning it would not be constant, but scale with $k$. This leaves me confused as to whether I am doing something wrong, or there is some sort of caveat to reverse differentiation that I'm not aware of. I highly suspect the former but I'm not really sure what to do differently.

Gradient calculation

Method A

Using
\begin{align*}
\dfrac{\partial \lambda_{i-1}}{\partial \phi} = \dfrac{\partial h_i}{\partial x_i}^T\dfrac{\partial \lambda_i}{\partial \phi} + \dfrac{\partial p_i}{\partial x_i}^T\dfrac{\partial \beta_i}{\partial \phi}
\end{align*}

so that the gradient is
\begin{align*}
\sum_{i=0}^nf^T\dfrac{\partial h_i}{\partial \theta}^T\dfrac{\partial \lambda_i }{\partial \phi} .
\end{align*}

You can massage this further, but you always end up having to form some sort of matrix in $\mathcal{R}^{k\times k}$.

Method B

The gradient is
\begin{align*}
\sum_{i=0}^{n-1}\xi_i^T\dfrac{\partial p_{i+1}}{\partial x_{i+1}}^T\dfrac{\partial \beta_{1+i}}{\partial \phi}
\end{align*}

where $\xi$ are given by
\begin{align*}
&\xi_0^T = f^T\dfrac{\partial h_0}{\partial \theta}^T \\
& \xi_{i+1}^T = f^T\dfrac{\partial h_{i+1}}{\theta}^T + \xi_i^T\dfrac{\partial h_{i+1}}{\partial x_{i+1}}^T
\end{align*}

In this case you have to calculate $\partial h_{i+1}/\partial x_{i+1}$.

Best Answer

There is nothing special about $F$, it is a fact that computing the gradient of any scalar function in reverse mode is comparable to evaluating the function. You are probably just getting irritated by the reverse indexing, additional derivatives present and non-standard notation.

To simplify notation, let $A_i = \big(\frac{\partial h_i}{\partial\theta}\big)^T$, $B_i = \big(\frac{\partial h_{i}}{\partial x_{i}}\big)^T$, $v_i =\big(\frac{\partial p_{i}}{\partial x_{i}}\big)^T$. So we want

$$ \frac{\partial}{\partial \phi} \sum_{i=n}^0 F_i, \qquad F_i = f^TA_i \lambda_i , \qquad \lambda_{i-1} = B_i \lambda_i + v_i \beta_i $$

Then

$$\begin{aligned} \frac{\partial}{\partial \phi} F_i &= \frac{\partial f^TA_i \lambda_i}{\partial \phi} \\&= \frac{\partial f^TA_i \lambda_i}{\partial \lambda_i}\frac{\partial \lambda_i}{\partial \phi} \\&= \frac{\partial f^TA_i \lambda_i}{\partial \lambda_i}\frac{\partial B_i \lambda_{i+1} + v_{i+1} \beta_{i+1}}{\partial \phi} \\&= \frac{\partial f^TA_i \lambda_i}{\partial \lambda_i}\Big(\frac{\partial B_i\lambda_{i+1}}{\partial \lambda_{i+1}}\frac{\partial\lambda_{i+1}}{\partial \phi}+ \frac{\partial v_{i+1}\beta_{i+1}}{\partial \beta_{i+1}}\frac{\partial \beta_{i+1}}{\partial \phi}\Big) \\&= \frac{\partial f^TA_i \lambda_i}{\partial \lambda_i}\Big(\frac{\partial B_i\lambda_{i+1}}{\partial \lambda_{i+1}}\Big(\frac{\partial B_{i+1}\lambda_{i+2}}{\partial \lambda_{i+2}}\frac{\partial\lambda_{i+2}}{\partial \phi}+ \frac{\partial v_{i+2}\beta_{i+2}}{\partial \beta_{i+2}}\frac{\partial \beta_{i+2}}{\partial \phi}\Big)+ \frac{\partial v_{i+1}\beta_{i+1}}{\partial \beta_{i+1}}\frac{\partial \beta_{i+1}}{\partial \phi}\Big) \\&=\ldots \end{aligned}$$

And then we just accumulate gradients from the left: $z^T_0 = \frac{\partial f^TA_i \lambda_i}{\partial \lambda_i}$ is a row vector. Then $z_1^T = z_0^T\frac{\partial B_i\lambda_{i+1}}{\partial \lambda_{i+1}}$ is a VJP, $z_2^T = z_1^T\frac{\partial B_{i+1}\lambda_{i+2}}{\partial \lambda_{i+2}}$ is a VJP, etc. pp.

Related Question