I am not sure if this helps, but I just found that you can actually use the contracted epsilon identity to get a pretty good vector/dyadic representation of the chain rule. The nice thing about this particular identity (at least to me) is that it does what any good chain rule should do, and applies the operator in question to the argument.
Please allow me to preempt this by saying that I am not that familiar with the conventions of dyadic notation, but I will present what I have figured out in index notation form, so that if anyone wants to go in, and fix my notation, they will know how to. Anyway, here is what I found:
\begin{equation}
\nabla \times \left(\mathbf{A} \circ \mathbf{B}\right) = -\left(\nabla_\mathbf{B}\cdot \mathbf{A}\right)\left(\nabla\times \mathbf{B}\right) + \left(\frac{\partial \mathbf{A}}{\partial \mathbf{B}}\right)^T\left(\nabla \times \mathbf{B}\right) + \left(\frac{\partial \mathbf{A}}{\partial \mathbf{B}}\right)^T \!\!\!\begin{array}{c}
_\cdot \\
^\times\end{array}\!\!\!\left(\nabla \mathbf{B}^T\right)
\end{equation}
The vertical operator notation means that I am crossing the second index of the left tensor with the 1st index of the right tensor, and then contracting the 1st index of the left tensor with the 2nd index of the right tensor.
Working it out in index notation, we start with
\begin{equation}
\epsilon_{ijk}\frac{\partial A_j}{\partial B_l}\frac{\partial B_l}{\partial X_k}
\end{equation}
What we want is to have the levi-civita symbol apply to B. The first step for this is to "free" the l and k indices. We can do this, using the kronecker delta
\begin{equation}
\epsilon_{ijk}\frac{\partial A_j}{\partial B_l}\frac{\partial B_l}{\partial X_k} = \delta_{k_0k_1}\delta_{l_0l_1}\epsilon_{ijk_0}\frac{\partial A_j}{\partial B_{l_0}}\frac{\partial B_{l_1}}{\partial X_{k_1}}
\end{equation}
Now, we can get to the contracted epsilon identity by adding appropriate cancelling terms to the RHS:
\begin{equation}
\left(\delta_{k_0k_1}\delta_{l_0l_1}\epsilon_{ijk_0}\frac{\partial A_j}{\partial B_{l_0}}\frac{\partial B_{l_1}}{\partial X_{k_1}} - \delta_{k_0l_1}\delta_{l_0k_1}\epsilon_{ijk_0}\frac{\partial A_j}{\partial B_{l_0}}\frac{\partial B_{l_1}}{\partial X_{k_1}}\right) + \delta_{k_0l_1}\delta_{l_0k_1}\epsilon_{ijk_0}\frac{\partial A_j}{\partial B_{l_0}}\frac{\partial B_{l_1}}{\partial X_{k_1}}
\end{equation}
We can then use the contracted epsilon identity to rewrite this as
\begin{equation}
\epsilon_{rk_0l_0}\epsilon_{rk_1l_1}\epsilon_{ijk_0}\frac{\partial A_j}{\partial B_{l_0}}\frac{\partial B_{l_1}}{\partial X_{k_1}} + \delta_{k_0l_1}\delta_{l_0k_1}\epsilon_{ijk_0}\frac{\partial A_j}{\partial B_{l_0}}\frac{\partial B_{l_1}}{\partial X_{k_1}}
\end{equation}
Now, we are halfway there, for $\epsilon_{rk_1l_1}$ applies to $\frac{\partial B_{l_1}}{\partial X_{k_1}}$
Notice also, that when we juxtapose $\epsilon_{rk_0l_0}$ and $\epsilon_{ijk_0}$ we can apply the contracted epsilon identity once again. Combining these two facts gives us
\begin{equation}
\left(-\delta_{ri}\delta_{l_0j} + \delta_{rj}\delta_{l_0i}\right)\frac{\partial A_j}{\partial B_{l_0}}\epsilon_{rk_1l_1}\frac{\partial B_{l_1}}{\partial X_{k_1}} + \delta_{k_0l_1}\delta_{l_0k_1}\epsilon_{ijk_0}\frac{\partial A_j}{\partial B_{l_0}}\frac{\partial B_{l_1}}{\partial X_{k_1}}
\end{equation}
The expression in vector/dyadic notation above follows directly from this.
For convenience, define the vectors
$$\eqalign{
b &= Ba,\qquad c=-2Cb \\
y &= \tfrac{1}{2}Au,\,\quad v = y+b+c \\
}$$
and the matrix
$$\eqalign{
V &= {\rm Reshape}(v,9,9) \quad\implies\quad v = {\rm vec}(V) \\
Y &= {\rm Reshape}(y,9,9) \quad\implies\quad y = {\rm vec}(Y) \\
}$$
Let's use a colon to denote the trace/Frobenius product, i.e.
$$\eqalign{
&A:B = {\rm Tr}(A^TB) = {\rm Tr}(B^TA)=B:A \\
&A:BC = B^TA:C = AC^T:B = etc \\
}$$
Then your cost function can be written as
$$\eqalign{
J &= y^Hu + b^Hu + c^Hu \\
&= (y+b+c)^*:u \\
&= v^*:u \\
}$$
Calculate the differential of $J$
$$\eqalign{
dJ &= v^*:du + u:dv^* \\
&= v^*:du + u:\left(\tfrac{1}{2}A^*du^*\right) \\
&= v^*:du + y:du^* \\
}$$
Next, utilize the relationship with $q$
$$\eqalign{
u &= {\rm vec}(q^*q^T) \\
du &= {\rm vec}(q^*dq^T+dq^*q^T) \\
du^* &= {\rm vec}(q\,dq^H+dq\,q^H) \\
}$$
Substitute this into the differential expression, then calculate the gradient
$$\eqalign{
dJ &= v^*:du + y:du^* \\
&= {\rm vec}(V^*):{\rm vec}(q^*dq^T+dq^*q^T)
+ {\rm vec}(Y):{\rm vec}(q\,dq^H+dq\,q^H) \\
&= V^*:\left(q^*dq^T+dq^*q^T\right) + Y:\left(q\,dq^H+dq\,q^H\right) \\
&= \left(V^H:dq\,q^H + V^*:dq^*q^T\right)
+ \left(Y^T:dq^*q^T + Y:dq\,q^H\right) \\
&= \left(V^Hq^*:dq + V^*q:dq^*\right) + \left(Y^Tq:dq^* + Yq^*:dq\right)\\
&= \left(V^Hq^*+Yq^*\right):dq + \left(V^*q+Y^Tq\right):dq^* \\
\frac{\partial J}{\partial q} &= V^Hq^* + Yq^*, \qquad
\frac{\partial J}{\partial q^*}= V^*q + Y^Tq \\
}$$
Since $J$ is real, these results implies that $(V,Y)$ are hermitian.
Some comments asked about how the trace/Frobenius product
behaves with vectors.
$$\eqalign{
{\rm Tr}(y^TAx) &= {\rm Tr}((Ax)^Ty)
&= {\rm Tr}((yx^T)^TA) &= {\rm Tr}(A(yx^T)^T) &= etc \\
y:Ax &= Ax:y &= yx^T:A &= A:yx^T &= etc \\\\
}$$
Update
The
Frobenius Inner Product described on Wikipedia is defined as
$$\langle A,B\rangle = {\rm Tr}(A^HB)$$
and is different from the product used in this answer which is
$$A:B={\rm Tr}(A^TB)$$
The
colon product has the following extremely useful properties
$$\eqalign{
A:B &= B:A \qquad&\big({\rm commutes\,with\,itself}\big) \\
A:B\odot C &= A\odot B:C \qquad&\big({\rm commutes\,with\,Hadamard}\big) \\
}$$
which hold even when the matrices involved are complex.
The Wikipedia product satisfies neither property.
I find the fact that
$\langle A,B\rangle \ne \langle B,A\rangle$
and trying to remember which matrix gets conjugated to be
a perpetual source of errors.
The only nice property of the Wikipedia product is
$$\|A\|_F^2 = \langle A,A\rangle$$
but this only
fractionally more succinct than
$$\|A\|_F^2 = A^*:A$$
Best Answer
You’re getting tripped up by a misconception that Git Gud pointed out in a comment: that the derivative of a multivariable function is always a vector. Although this is true when the function’s codomain is one-dimensional, which might be all that you’ve been exposed to until now, it’s not generally true for vector-valued functions.
Assuming the “bulk” form of the chain rule that you’ve cited, we have, as you say, $\nabla\phi = \nabla f\nabla g$. Looking at this in purely algebraic terms, $\nabla\phi$ is a $1\times2$ matrix (a vector) as is $\nabla f$, so there are only two possibilities for $\nabla g$: it’s either a scalar or a $2\times2$ matrix. Multiplication by a scalar is the same as multiplication by a multiple of the identity matrix, so both of these cases can be combined into one.
What does this $2\times2$ matrix look like? Expanding the components of $\nabla\phi$ we have $$\begin{align}\nabla\phi = \begin{pmatrix}{\partial\phi\over\partial x} & {\partial\phi\over\partial y} \end{pmatrix} &= \begin{pmatrix} {\partial f\over\partial u}{\partial u\over\partial x}+{\partial f\over\partial v}{\partial v\over\partial x} & {\partial f\over\partial u}{\partial u\over\partial y}+{\partial f\over\partial v}{\partial v\over\partial y} \end{pmatrix} \\ &= \begin{pmatrix}{\partial f\over\partial u} & {\partial f\over\partial v} \end{pmatrix} \begin{pmatrix} {\partial u\over\partial x} & {\partial u\over\partial y} \\ {\partial v\over\partial x} & {\partial v\over\partial y} \end{pmatrix}.\end{align}\tag{*}$$ Comparing this to the product $\nabla\phi = \nabla f\nabla g$ we can see that $\nabla g$ must be the $2\times2$ matrix of partial derivatives in the last line above. This matrix of partial derivatives is known as the Jacobian matrix of $g$ and one can think of the gradient of a scalar-valued function as a special case of this.
The first row of the above matrix is $\nabla u$ and the second row $\nabla v$, so we can write $$\nabla g = \begin{pmatrix}\nabla u\\\nabla v\end{pmatrix}.$$ This point of view in which the rows or columns of a matrix are treated as individual row/column vectors is quite common, for instance when we talk about the row and column spaces of a matrix or describe the columns of a transformation matrix as the images of the basis vectors. The product of a row vector and matrix, such as we have in (*) can be viewed as a linear combination of the rows of the matrix with coefficients given by the vector, i.e., $$\nabla\phi = \nabla f\nabla g = {\partial f\over\partial u}\nabla u+{\partial f\over\partial v}\nabla v.$$
Another way to see why $\nabla g$ is a matrix is to go back to the definition of the differential of a function as the linear map that best approximates the change in the function’s value near a point: $$f(\mathbf v+\mathbf h)-f(\mathbf v) = \operatorname{d}f_{\mathbf v}[\mathbf h]+o(\mathbf h).$$ From this definition, we see that if, say, $f:\mathbb R^n\to\mathbb R^m$, then $\operatorname{d}f_{\mathbf v}:\mathbb R^n\to\mathbb R^m$, too. (Technically, the domain of $\operatorname{d}f_{\mathbf v}$ is the tangent space at $\mathbf v$, but we can gloss over that when working in $\mathbb R^n$.) This means that $\operatorname{d}f_{\mathbf v}$, when expressed in coordinates, is an $m\times n$ matrix, the Jacobian matrix of $f$, in fact. Now, strictly speaking $\operatorname{d}f_{\mathbf v}$ and the derivative of $f$ aren’t quite the same thing—for $f:\mathbb R^n\to\mathbb R$, for instance, $\operatorname{d}f_{\mathbf v}$ is a linear functional that lives in the dual space $\mathbb (R^n)^*$ while $\nabla f(\mathbf v)$ is a vector in $\mathbb R^n$—but with a suitable choice of coordinate systems we can blithely ignore this distinction, identify them and say that $\nabla g$ is also $2\times2$ matrix.