$
\def\b{\bullet}
\def\e{\varepsilon}
\def\m#1{\left[\begin{array}{c}#1\end{array}\right]}
\def\p#1#2{\frac{\partial #1}{\partial #2}}
$I
really like The Matrix Cookbook but the section on structured matrices is not very good, so here's a different approach to the subject.
Given a vector of parameters $\{p\}$ and matrix basis $\{B_i\}$
$$\eqalign{
p &= \m{\alpha \\ \beta},\qquad
B_1 = \m{1 & 1 & 0 & 0 \\ 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0},\qquad
B_2 = \m{0 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 1 & 1 & 0}
\\
}$$
create a structured matrix $\{A\}$ and cost function $\{\phi\}$
$$\eqalign{
A &= \sum_{i=1}^2\;p_iB_i
\;=\; \m{\alpha & \alpha & 0 & 0 \\ 0 & \beta & 0 & 0 \\ 0 & \beta & \beta & 0},\qquad
&\phi = \tfrac 12\Big\|AX-Y\Big\|_F^2 \\
}$$
Note that $(\alpha,\beta)$ are the only independent variables in the entire problem.
When $A$ is unconstrained it's easy to calculate the gradient/differential of the cost
$$\eqalign{
G = \p{\phi}{A} = (AX-Y)X^T \quad\implies\quad d\phi = G\b dA \\
}$$
where the bullet denotes the matrix inner product, i.e.
$$\eqalign{
G\b dA
&= \sum_{i=1}^3\sum_{j=1}^4 G_{ij}\;dA_{ij} \;=\; {\rm Tr}(G^TdA) \\
}$$
Because of the structure which was imposed on $A$,
its differential is also structured
$$dA = \sum_{i=1}^2 B_i\,dp_i$$
Substituting this expression leads to the parametric gradient
$$\eqalign{
d\phi &= \sum_{i=1}^2\;G\b(B_i\,dp_i)
= \sum_{i=1}^2\left(\p{\phi}{p_i}\right)dp_i \quad\implies\quad
\p{\phi}{p_i} = G\b B_i \\
}$$
At this point, one would do all further calculations
in terms of the $p$-vector.
Now comes the weird part...
Every basis $\{B_i\}$ has a dual basis $\{B_i^\delta\}$ which spans the same subspace $\cal S$, but is orthonormal with respect to the inner product
$$B_i\b B_j^\delta \;=\; \delta_{ij}$$
Some bases are self-dual, such as the canonical vector basis
$\{\e_i\}$, but in general determining the dual basis requires a pseudoinverse calculation
$$\eqalign{
&\;b_k = {\rm vec}(B_k) \qquad &\;b_k^\delta = {\rm vec}(B_k^\delta) \\
&\m{b_1 & b_2 &\ldots & b_p}^+ = &\m{b_1^\delta & b_2^\delta &\ldots & b_p^\delta}^T \\
}$$
In the vector case, the gradient with respect to the $p$-vector can be written as the sum of each component multiplied by the corresponding vector from the dual basis, i.e.
$$\eqalign{
\p{\phi}{p}
&= \sum_{i=1}^2 \left(\p{\phi}{p_i}\right)\e_i \\
}$$
Many authors extend this idea and define the structured gradient
as the matrix
$$\eqalign{
\left(\p{\phi}{A}\right)_S
&= \sum_{i=1}^2\left( \p{\phi}{p_i} \right) B_i^\delta \\
&= \sum_{i=1}^2\left(G\b B_i\right) B_i^\delta \\
&= G\b\left(\sum_{i=1}^2 B_i B_i^\delta \right) \\
&= G\b{\cal B} \\
}$$
where $\cal B$ is a fourth-order tensor with components
$${\cal B}_{jk\ell m}
= \sum_{i=1}^2\;\left(B_i\right)_{jk}\,\left(B_i^\delta\right)_{\ell m} \\$$
The $\cal B$ tensor is a projector into the subspace
$\big(\,{\cal B}\b X\in{\cal S}\;\;{\rm for}\;X\in{\mathbb R}^{3\times 4}\big)$
where it also acts as an identity tensor for the subspace
$\big({\cal B}\b M=M\b{\cal B} = M\;\;{\rm for}\;M\in{\cal S}\big)$ .
If the basis spans the whole space $\,{\cal S}\equiv{\mathbb R}^{3\times 4}\,$ then $\cal B$ becomes the true identity tensor $\cal I$, and the structured gradient is identical to the full unstructured gradient $G$
(as expected).
$$\eqalign{
{\cal B}_{jk\ell m} \;&\to\; {\cal I}_{jk\ell m}
= \delta_{j\ell}\delta_{km} \\
(G\b{\cal B}) \;&\to\; (G\b{\cal I}) = G \\
}$$
As a concrete example, let's examine a symmetrically constrained $2\times 2$ matrix.
$$\eqalign{
p &= \m{\alpha \\ \beta \\ \lambda},\qquad
B_1 = \m{1 & 0 \\ 0 & 0},\qquad
B_2 = \m{0 & 0 \\ 0 & 1},\qquad
B_3 = \m{0 & 1 \\ 1 & 0} \\
A &= \m{\alpha & \lambda \\ \lambda & \beta}
\quad=\quad \alpha B_1 + \beta B_2 + \lambda B_3,\qquad
B_k^\delta = \frac{B_k}{B_k\b B_k} \\
}$$
The structured gradient calculation then goes as follows
$$\eqalign{
\left(\p{\phi}{A}\right)_S
&= \frac{(G\b B_1)B_1}{B_1\b B_1}
+ \frac{(G\b B_2)B_2}{B_2\b B_2}
+ \frac{(G\b B_3)B_3}{B_3\b B_3} \\
&= G_{11}\,B_1 +G_{22}\,B_2 +\tfrac 12(G_{12}+G_{21})\,B_3 \\
&= \m{G_{11} & \tfrac 12(G_{12}+G_{21}) \\ \tfrac 12(G_{12}+G_{21}) & G_{22}} \\
&= \left(\frac{G+G^T}{2}\right) \;\doteq\; {\rm Sym}(G) \\
}$$
But The Matrix Cookbook uses the regular basis instead of the dual basis
which results in the following miscalculation
$$\eqalign{
\left(\p{\phi}{A}\right)_{S^*}
&= \left(G\b B_1\right)B_1
+ \left(G\b B_2\right)B_2
+ \left(G\b B_3\right)B_3 \\
&= G_{11}\,B_1 +G_{22}\,B_2 +(G_{12}+G_{21})\,B_3 \\
&= \m{G_{11} & (G_{12}+G_{21}) \\ (G_{12}+G_{21}) & G_{22}} \\
&= G+G^T-{\rm Diag}(G) \\
}$$
The skew-symmetric case is similar but is seldom mentioned.
There is only one parameter and one matrix in the basis
$$\eqalign{
p &= \m{\alpha},\qquad B = \m{0 & 1 \\ -1 & 0},\qquad
B^\delta = \frac{B}{B\b B} \\
A &= \m{0 & \alpha \\ -\alpha & 0} \;\;=\;\; \alpha B \\
}$$
and the structured gradient is
$$\eqalign{
\left(\p{\phi}{A}\right)_S
&= \frac{(G\b B)B}{B\b B} \\
&= \tfrac 12(G_{12}-G_{21})\,B \\
&= \m{0 & \tfrac 12(G_{12}-G_{21}) \\ \tfrac 12(G_{21}-G_{12}) & 0} \\
&= \left(\frac{G-G^T}{2}\right) \;\doteq\; {\rm Skew}(G) \\
}$$
If you use $B$ instead of $B^\delta$ in this case, the gradient has the right direction but the wrong length, i.e.
$$\eqalign{
\left(\p{\phi}{A}\right)_{S^*}
&= \left(G-G^T\right) \;=\; 2\;{\rm Skew}(G) \\
}$$
$
\def\p{\partial}
\def\L{\left}\def\R{\right}\def\LR#1{\L(#1\R)}
\def\diag#1{\operatorname{diag}\LR{#1}}
\def\Diag#1{\operatorname{Diag}\LR{#1}}
\def\trace#1{\operatorname{Tr}\LR{#1}}
\def\grad#1#2{\frac{\p #1}{\p #2}}
$For typing convenience, define the matrix variables
$$\eqalign{
U &= U^T = \Diag{u}
\quad&\implies\quad dU = \Diag{du} \\
B &= B^T = \LR{AUA^T}^{-1}
\quad&\implies\quad dB = -B\LR{A\;dU\,A^T}B \\
}$$
and the Frobenius product, which is a convenient notation for the trace, i.e.
$$\eqalign{
A:B &= \sum_{i=1}^m\sum_{j=1}^n A_{ij}B_{ij} \;=\; \trace{AB^T} \\
A:A &= \big\|A\big\|^2_F \\
}$$
Write the objective function using this notation. Then calculate the differential and gradient.
$$\eqalign{
s &= x^TBx \\&= xx^T:B \\
ds &= xx^T:dB \\
&= -xx^T:B{A\;dU\,A^T}B \\
&= -{A^TBxx^TBA}:\Diag{du} \\
&= -\diag{A^TBxx^TBA}:{du} \\
\grad{s}{u} &= -\diag{A^TBxx^TBA} \\\\
}$$
The properties of the underlying trace function allow the terms in a Frobenius to be rearranged in many different ways, e.g.
$$\eqalign{
A:B &= B:A \\
A:B &= A^T:B^T \\
CA:B &= C:BA^T = A:C^TB \\
}$$
Unlike the Chain Rule, with the differential approach there is no need to calculate higher-order tensors like $\LR{\grad{M}{u},\;\grad{M^{-1}}{M},\;etc}.\;$ Further, the differential $dB$ obeys the same rules of algebra as the matrix $B$, making it easy to manipulate. Higher-order tensors do not.
Best Answer
The easiest way to calculate the derivatives of matrix valued functions is to go back to the definition of derivative in the usual sense of limits, so if $$f(A,B,C) = (A + BC)^{T}(A+BC),$$ then for some small $t > 0$, and some $E \in M_{3\times 1}(\mathbb{R})$, we calculate the directional derivative in the "direction" of $E$: $$\frac{\partial f(A,B,C)}{\partial C}(E) = \lim_{t\rightarrow 0}\frac{f(A,B,C+tE) - f(A,B,C)}{t}\\ = \lim_{t\rightarrow 0}\frac{1}{t}\bigg[ \big(A + B(C+tE)\big)^{T}\big(A+B(C+tE)\big) -(A + BC)^{T}(A+BC)\bigg]\\ \lim_{t\rightarrow 0}\frac{1}{t}\bigg[ (A + BC)^{T}(A+BC) + t(BE)^{T}(A+BC) + t(A+BC)^{T}(BE) + t^{2}(BE)^{T}(BE)\\ - (A+BC)^{T}(A+BC)\bigg]\\ =\lim_{t\rightarrow 0}\bigg[(BE)^{T}(A+BC) + (A+BC)^{T}(BE) + t(BE)^{T}(BE)\bigg]\\ \\ =(BE)^{T}(A+BC) + (A+BC)^{T}(BE), $$ i.e. $$\frac{\partial f(A,B,C)}{\partial C}(E) = (BE)^{T}(A+BC) + (A+BC)^{T}(BE) $$ which is similar to your answer just now the (3,1) matrix $E \in M_{3 \times 1}(\mathbb{R})$ sorts out the dimension problem you were having.