Calculus – Gradient and Hessian Using Chain/Product Rule

calculusderivativeshessian-matrixmatrix-calculusmultivariable-calculus

I am reading a paper (How Hessian structure explains mysteries in sharpness regularization), and not sure how Eq (2), Pg 2 was derived from Eq (1), using standard product or chain rule.

In general, my approach (picked up from other posts on MathSE) towards using chain / product rule in multivariate derivatives (vector or matrix) has been to compute the differential [*] $D(f\circ g) = Df(g) Dg,$ or $D(fg) = (Df)\cdot g +g\cdot Df$, and then infer the gradient (in the usual column vector convention) using any transpose operations if needed.

However, its not clear how to use that approach for 2nd order derivatives, so my questions are:

  1. Taking the Hessian $\nabla^2_{\theta}\mathcal{L} = D[\nabla_{\theta}\mathcal{L}],$ how to arrive at (2) from (1)? Using product rule, (1) becomes \begin{align*}D[\nabla_{\theta}\mathcal{L}] = D_{\theta}[J^\top\nabla_{z}\mathcal{L}]&=D_{\theta}[J]^\top\nabla_{z}\mathcal{L} + J^\top D_{\theta}[\nabla_{z}\mathcal{L}].\end{align*} How to compute (a) the first term (unsure what differential of Jacobian is), and (b) the second term (unsure what differential of a gradient is when the variables are different).

  2. Is my initial approach correct? It feels a bit "hacky" at times, but mostly seems to work (e.g. I was able to compute Eq (13a,b,c) on Pg 20 of this paper); would really appreciate a reference which discusses this rigorously (and not just list down formulae for use) to get more familiar with such computations.

Thanks a lot!

[*]: Differential of function $f : \mathbb{R}^n \rightarrow \mathbb{R}^m$ is defined as $f(x + \Delta x) = f(x) + Df(x) \Delta x$ where multiplications are defined in the usual matrix multiplication sense.

Best Answer

$ \def\l{\lambda} \def\n{\nabla} \def\k{\otimes} \def\h{\odot} \def\O{{\large\Omega}} \def\o{{\tt1}} \def\L{{\cal L}} \def\LR#1{\left(#1\right)} \def\op#1{\operatorname{#1}} \def\trace#1{\op{Tr}\LR{#1}} \def\frob#1{\left\| #1 \right\|_F} \def\qiq{\quad\implies\quad} \def\p{\partial} \def\grad#1#2{\frac{\p #1}{\p #2}} \def\hess#1#2#3{\grad{^2 #1}{#2\,\p #3}} \def\Hess#1#2{\grad{^2 #1}{#2^2}} \def\c#1{\color{red}{#1}} \def\CLR#1{\c{\LR{#1}}} \def\fracLR#1#2{\LR{\frac{#1}{#2}}} \def\gradLR#1#2{\LR{\grad{#1}{#2}}} \def\hessLR#1#2#3{\LR{\hess{#1}{#2}{#3}}} \def\HessLR#1#2{\LR{\Hess{#1}{#2}}} $Let's use Leibniz notation instead of $\n$ notation and a variable naming convention wherein uppercase $({\rm e.g.}\;J)\,$ denotes a matrix, lowercase $(z)\,$ a vector, and calligraphy $(\L\,)$ a scalar. Towards that end rename $\theta\to p\,$ for the vector of parameters.

Finally, let's use the Frobenius product $(:)$ to avoid transposition errors $$\eqalign{ A:B &= \sum_{i=1}^m\sum_{j=1}^n A_{ij}B_{ij} \;=\; \trace{A^TB} \\ B:B &= \frob{B}^2 \qquad \{ {\rm Frobenius\;norm} \}\\ \LR{XY}:B &= X:\LR{BY^T} \;=\; Y:\LR{X^TB} \\ }$$ In this problem, the variables have the following functional dependence $$\eqalign{ \L &= \L(y,z) &\qiq&w=\grad{\L}{z},\quad&H=\grad{w}{z}=\Hess{\L}{z} \\ z &= z(x,p) &\qiq&J=\grad{z}{p},\quad&\O=\grad{J}{p}=\Hess{z}{p} \\ }$$ Now let's calculate the differential and gradient of $\L$ wrt $p$ $$\eqalign{ d\L &= w:\c{dz} \\&= w:\CLR{J\:dp} \\&= \LR{J^Tw}:dp \\ \grad{\L}{p} &= {J^Tw} \qquad\qquad\qquad\;\; \{{\tt 1}\} \\ }$$ Then the differential and gradient of $\LR{g\doteq\large\grad{\L}{p}}$ wrt $p$ $$\eqalign{ dg &= J^T\c{dw} \;+\; \c{\LR{dJ}^T}w \\ &= J^T\CLR{H\,dz} \;+\; \c{\LR{\O\cdot dp}^T}w \\ &= J^THJ\,dp \;+\; \LR{w\cdot\O}dp \\ \grad{g}{p} &= {J^THJ \;+\; {w\cdot\O}} \\ \Hess{\L}{p} &= {J^THJ \;+\; {w\cdot\O}} \qquad \{{\tt 2}\} \\ }$$ The tricky part is that $\O$ is a third-order tensor, which does not fit into standard matrix notation, so one must resort to explicit dot products and/or index notation, e.g. $$\eqalign{ da &= \LR{\O\cdot dp}^Tw \\ da_j &= \LR{\O_{ijk}\,dp_k}w_i = \LR{w_i\,\O_{ijk}} dp_k \\ da &= \LR{w\cdot\O}\cdot dp \\ }$$

Related Question