Don't worry, multidimensional analysis is a shock to the system the first time that you see it.
1) The norm is the distance to the origin $d(x,0)$, assumed to be Euclidean, so $$\lVert x \rVert = \sqrt{\sum x_i^2}$$
2) Okay, let's get a grip on this. $$f:\mathbb{R}^n \to \mathbb{R}^m$$ so we have functions $f_1(x_1,\cdots,x_n), \ldots, f_m(\cdots)$. There are really $m$ unrelated functions $f_i:\mathbb{R}^n \to \mathbb{R}$
Great; now, what is a derivative for a function like $f_i$? It's the gradient $\nabla f_i$. Hence the whole lot of derivative information is encoded in a $m\times n$ matrix $A=J_f(u)$ with components $$A_{ij} = \partial_j f_i$$
Right! But now we want a formal definition of differentiability. How do we do it for a single $f_i=g$? We say $\nabla g$ is the derivative if it tells us what the *directional * derivatives are. How does it do this?
We want $g(x+dx) = g(x) + dx \cdot \nabla g$ - small changes get dotted with the derivative. So formally, we want a vector $v$ such that $$\lim_{h\to 0} |g(x+h) - g(x) - h \cdot v | / \lVert h \rVert = 0$$ so that the error is smaller than $h$.
But now for all the $f_i$ we get a different vector $v_i$. Putting these all together into $A$ and choosing the denominator to bound the error for any one $f_i$ gives the result you have!
Edit: To summarize, $h$ is a small change in the coordinates, $Ah$ is the directional derivative of all the separate $f_i$s in this direction, and $A$ is the 'gradient' containing all derivative information.
There is nothing special about $F$, it is a fact that computing the gradient of any scalar function in reverse mode is comparable to evaluating the function. You are probably just getting irritated by the reverse indexing, additional derivatives present and non-standard notation.
To simplify notation, let $A_i = \big(\frac{\partial h_i}{\partial\theta}\big)^T$, $B_i = \big(\frac{\partial h_{i}}{\partial x_{i}}\big)^T$, $v_i =\big(\frac{\partial p_{i}}{\partial x_{i}}\big)^T$. So we want
$$
\frac{\partial}{\partial \phi} \sum_{i=n}^0 F_i, \qquad F_i = f^TA_i \lambda_i , \qquad \lambda_{i-1} = B_i \lambda_i + v_i \beta_i
$$
Then
$$\begin{aligned}
\frac{\partial}{\partial \phi} F_i &= \frac{\partial f^TA_i \lambda_i}{\partial \phi}
\\&= \frac{\partial f^TA_i \lambda_i}{\partial \lambda_i}\frac{\partial \lambda_i}{\partial \phi}
\\&= \frac{\partial f^TA_i \lambda_i}{\partial \lambda_i}\frac{\partial B_i \lambda_{i+1} + v_{i+1} \beta_{i+1}}{\partial \phi}
\\&= \frac{\partial f^TA_i \lambda_i}{\partial \lambda_i}\Big(\frac{\partial B_i\lambda_{i+1}}{\partial \lambda_{i+1}}\frac{\partial\lambda_{i+1}}{\partial \phi}+ \frac{\partial v_{i+1}\beta_{i+1}}{\partial \beta_{i+1}}\frac{\partial \beta_{i+1}}{\partial \phi}\Big)
\\&= \frac{\partial f^TA_i \lambda_i}{\partial \lambda_i}\Big(\frac{\partial B_i\lambda_{i+1}}{\partial \lambda_{i+1}}\Big(\frac{\partial B_{i+1}\lambda_{i+2}}{\partial \lambda_{i+2}}\frac{\partial\lambda_{i+2}}{\partial \phi}+ \frac{\partial v_{i+2}\beta_{i+2}}{\partial \beta_{i+2}}\frac{\partial \beta_{i+2}}{\partial \phi}\Big)+ \frac{\partial v_{i+1}\beta_{i+1}}{\partial \beta_{i+1}}\frac{\partial \beta_{i+1}}{\partial \phi}\Big)
\\&=\ldots
\end{aligned}$$
And then we just accumulate gradients from the left: $z^T_0 = \frac{\partial f^TA_i \lambda_i}{\partial \lambda_i}$ is a row vector. Then $z_1^T = z_0^T\frac{\partial B_i\lambda_{i+1}}{\partial \lambda_{i+1}}$ is a VJP, $z_2^T = z_1^T\frac{\partial B_{i+1}\lambda_{i+2}}{\partial \lambda_{i+2}}$ is a VJP, etc. pp.
Best Answer
Let $p=(x,y,z)$. Then $$\frac d{dx}\frac x{\sqrt{x^2+y^2+z^2}}=\frac {y^2+z^2}{(x^2+y^2+z^2)^{\frac32}}$$ and $$\frac d{dx}\frac 1{\sqrt{x^2+y^2+z^2}}=-\frac x{(x^2+y^2+z^2)^{\frac32}}$$ Then $$Df(p)=\frac 1{(x^2+y^2+z^2)^{\frac32}}\begin{pmatrix}y^2+z^2&-xy&-xz\\-xy&x^2+z^2&-yz\\-xz&-yz&x^2+y^2\end{pmatrix}$$ Let the Matrix part be $M$. Then elements of $M^2$ are $$(M^2)_{11}=(y^2+z^2)^2+x^2y^2+x^2z^2=(y^2+z^2)(x^2+y^2+z^2)$$ $$(M^2)_{12}=(y^2+z^2)(-xy)-xy(x^2+z^2)+xyz^2=-xy(x^2+y^2+z^2)$$ I think you can continue from here.