Unit quaternions are great for parameterizing rotation in 3-D space, but trying to estimate them directly in a conventional Kalman filter setting can be tricky. This is because unit quaternions are constrained to live on the unit sphere in 4-D space ($S^3 \subset \mathbb{R}^4$). Hence, their probability density function (pdf) is restricted to the surface of the unit sphere. If one uses a Gaussian distribution to parameterize the pdf (as is done in a Kalman filter), the expectation conditioned on the measurements will lie inside the unit sphere and hence by definition will not be a unit quaternion. In addition the covariance matrix will shrink in the directions orthogonal to the surface of the unit sphere, which leads to a singular covariance matrix after several updates. This conceptual problem is explained in more detail in the references linked below.
In order to circumvent this estimation problem, a common engineering practice is to represent the true orientation ($\pmb{q}$) as a small deviation from a reference orientation ($\bar{\pmb{q}}$) as:
$$ \pmb{q} = \bar{\pmb{q}} \oplus \pmb{\delta} (\pmb{e}) $$
The deviation $\pmb{\delta} \in S^3$ can be approximately parameterized by an error vector $\pmb{e} \in \mathbb{R}^3$ as:
$$ \pmb{\delta} \approx \begin{bmatrix} 1 & \frac{\pmb{e}}{2}\end{bmatrix}^T $$
For small orientation deviations, this approximation is good upto the second order. The idea then is to compute an estimate of the error vector $\hat{\pmb{e}}$ within the Kalman filter while simultaneously and separately propagating the reference quaternion through the numerical integration of:
$$\dot{\bar{\pmb{q}}} = \frac{1}{2} \cdot \bar{\pmb{q}} \oplus \begin{bmatrix} 0 \\ \bar{\pmb{\omega}} \end{bmatrix} $$
For this diff equation if we can assume that the reference angular velocity ($\bar{\pmb{\omega}}$) remains constant during the sample time, the discrete equivalent is:
$$ \bar{\pmb{q}}_k = \bar{\pmb{q}}_{k-1} \oplus \left[
\begin{matrix}
cos(||\pmb{\omega}_{k-1}|| \frac{\Delta t}{2}) \\
\frac{\pmb{\omega}_{k-1}}{||\pmb{\omega}_{k-1}||} \cdot sin(||\pmb{\omega}_{k-1}||\frac{\Delta t}{2})
\end{matrix}
\right]
$$
The propagation dynamics for the error state can be shown to be linear (approximately) and is given by:
$$\dot{\pmb{e}} = \pmb{F}\pmb{e} + \pmb{G}\pmb{\eta}$$
where,
$\pmb{\eta} = \pmb{\omega} - \bar{\pmb{\omega}} $ - Error angular velocity assumed to be a white noise process with spectral density matrix $Q$
$\pmb{F} = - \left[ \bar{\pmb{\omega}} \times \right]$
$\pmb{G} = \pmb{I}$
Derivations of the propagation dynamics and the matrices $\pmb{F}$ and $\pmb{G}$ can be found in the references given below.
The covariance propagation equation is:
$$\dot{\pmb{P}}_e = \pmb{F}\pmb{P}_e + \pmb{P}_e\pmb{F}^T + \pmb{G}\pmb{Q}\pmb{G}^T$$
It is also worth noting that when $\pmb{e} = \pmb{0}$, then $\pmb{\delta} (\pmb{e})$ is the identity quaternion. Thus, after each measurement update, the error vector $\pmb{e}$ can be reset to zero by updating the reference quaternion as:
$$\bar{\pmb{q}}^+_k = \bar{\pmb{q}}^-_k \oplus \pmb{\delta} (\hat{\pmb{e}}_k)$$
Hope this helps!
References:
- Kalman Filtering for Attitude Estimation with Quaternions and Concepts from Manifold Theory
- Attitude Error Representations for Kalman Filtering
$
\def\o{{\tt1}}
\def\p{\partial}
\def\LR#1{\left(#1\right)}
\def\op#1{\operatorname{#1}}
\def\trace#1{\op{Tr}\LR{#1}}
\def\qiq{\quad\implies\quad}
\def\grad#1#2{\frac{\p #1}{\p #2}}
\def\c#1{\color{red}{#1}}
\def\gradLR#1#2{\LR{\grad{#1}{#2}}}
$As you have discovered, the chain rule can be difficult to apply in Matrix Calculus because it involves higher-order tensors (i.e. matrix-by-vector, vector-by-matrix, and matrix-by-matrix gradients) which are difficult to calculate, awkward to manipulate, and don't fit into standard matrix notation.
Instead I would recommend a differential approach, because the differential of a matrix behaves like a matrix. In particular, it can be written using standard matrix notation and it obeys all of the rules of matrix algebra.
Also, the $\c{\rm Frobenius}$ product is extraordinarily useful in Matrix Calculus
$$\eqalign{
A:B &= \sum_{i=1}^m\sum_{j=1}^n A_{ij}B_{ij} \;=\; \trace{A^TB} \\
A:A &= \|A\|^2_F \qquad \big\{{\rm \c{Frobenius}\;norm}\big\}\\
}$$
The properties of the underlying trace function allow the terms in such a
product to be rearranged in many equivalent ways, e.g.
$$\eqalign{
A:B &= B:A \\
A:B &= A^T:B^T \\
C:\LR{AB} &= \LR{CB^T}:A \;=\; \LR{A^TC}:B \\\\
}$$
Using the above notation, the manipulation of your particular function becomes almost mechanical
$$\eqalign{
df &= \gradLR fZ:dZ \\
&= \gradLR fZ:\LR{dX^TY} \\
&= \gradLR fZ Y^T:dX^T \\
&= Y\gradLR fZ^T:dX \\
\grad fX &= Y\gradLR fZ^T \\
}$$
Note that there was no need for a tensor-valued gradient in any step, just plain old matrices.
Also note that your initial dimension-matching approach was correct! That's not as crazy an idea as it seems. When dealing with rectangular matrices there is often only one way to fit all the pieces together and it's a useful shortcut. But it won't help when dealing with square matrices.
Best Answer
Define the $4\times 1$ state vector $$\eqalign{ s = \pmatrix{p\\v} }$$ and $4\times 2$ matrix analogs $\big\{E_k\big\}$ of the $2\times 1$ cartesian basis vectors $\{e_k\}$ $$\eqalign{ e_1 &= \pmatrix{{\tt1}\\0},\qquad &e_2 &= \pmatrix{0\\{\tt1}},\qquad &I_2&=e_1e_1^T+e_2e_2^T \\ E_1 &= \pmatrix{I_2\\0_2},\qquad &E_2 &= \pmatrix{0_2\\I_2},\qquad &I_4 &= E_1E_1^T+E_2E_2^T \\ I_2 &= E_1^TE_1,\qquad &I_2 &= E_2^TE_2,\qquad &0_2 &= E_1^TE_2 = E_2^TE_1 \\ p &= E_1^Ts,\qquad &v &= E_2^Ts,\qquad &I_4 &= \frac{\partial s}{\partial s} \\ }$$ Write the primed quantities in term of $s$ $$\eqalign{ p' &= E_1^Ts' &= \frac{H_1E_1^Ts+h_2}{h_3^TE_1^Ts+h_4} \\ v' &= E_2^Ts' &= \frac{(h_3^TE_1^Ts+h_4) H_{1}E_2^Ts-(H_1E_1^Ts+h_2) h_3^TE_2^Ts}{(h_3^TE_1^Ts+h_4)^2} }$$ Multiply the first equation by $E_1$ , the second by $E_2$, and add them together to obtain $$\eqalign{ s' &= \frac{E_1H_1E_1^Ts+E_1h_2}{h_3^TE_1^Ts+h_4} + \frac{(h_3^TE_1^Ts+h_4)E_2H_1E_2^Ts-(E_2H_1E_1^Ts+E_2h_2) h_3^TE_2^Ts}{(h_3^TE_1^Ts+h_4)^2} \\ F &= \frac{\partial s'}{\partial s} \;=\; \ldots \\ }$$ I'll leave the tedious details of that last calculation to you.
Update
To keep things tidy define $$\eqalign{ w_{jk} &= E_jh_k \quad &&M_{jk} = E_jH_1E_k^T \qquad M = M_{11} + M_{22} \\ \alpha &= w_{13}^Ts+h_4 &\implies\quad&\frac{\partial\alpha}{\partial s} = w_{13}^T \\ \beta &= w_{23}^Ts &\implies&\frac{\partial\beta}{\partial s} = w_{23}^T \\ }$$ Then calculate
$$\eqalign{ s'&= \alpha^{-1} (Ms+w_{12}) - \alpha^{-2}\beta(M_{21}s+w_{22}) \\ \\ \frac{\partial s'}{\partial s} &= \alpha^{-1} \big(M\big) \\ &-\; \alpha^{-2} \Big( (Ms + w_{12})w_{13}^T + \beta M_{21} + (M_{21}s + w_{22})w_{23}^T\Big) \\ &+\; 2\alpha^{-3}\beta\Big((M_{21}s + w_{22})w_{13}^T\Big) \\ }$$