The derivative of a function with respect to a matrix (or vector) doesn’t have the same size as the original input.

chain rulederivativeslinear algebramatricesmatrix-calculus

I have the scenario where I am given historical data for m assets with $S^{(T+1)\times{m}}$ as the matrix for asset values over time. I want to find the derivative of this function $f$ with respect to the vector $w$
$$f(w)=\frac{1}{2} ln(a)+\frac{1}{2}$$
where
$$a=\frac{1}{T}\left\| b_1-cb_0\right\|^{2},$$
$$b_1=BS_{1:T}w,$$
$$b_0=BS_{0:T-1}w,$$
$$B=\textbf{I}-\frac{\textbf{11}^T}{T},$$
$$c=\frac{b_{0}^{T}b_1}{\left\|b_0 \right\|^{2}}$$
These equations come from this paper, btw.

I've tried, like this:
\begin{aligned}\frac{df}{dw}
&=\frac{1}{2a} \times \frac{d}{dw}a \\
&=\frac{1}{2a} \times \frac{d}{dw} \left (\frac{1}{T}\left\| b_1-cb_0\right\|^{2} \right) \\
&=\frac{1}{2Ta} \times \frac{d}{dw} \left (\left\| b_1-cb_0\right\|^{2} \right) \\
&=\frac{1}{Ta} \left(\frac{d}{dw}\left (b_1-cb_0 \right )\right)^T \left (b_1-cb_0 \right ) \\
\end{aligned}

The result of this derivation should be a matrix of the same size with $w$, which is m x 1, right?

The latter part $\left(b_1-cb_0 \right)$ is matrix of size T x 1. So, $\frac{d}{dw}\left (b_1-cb_0 \right)$ should be a matrix of size T x m.

I tried to calculate $\frac{d}{dw}\left (b_1-cb_0 \right )$ like this:
\begin{aligned} \frac{d}{dw}\left (b_1-cb_0 \right )
&=\frac{d}{dw}b_1 – \frac{d}{dw} cb_0 \\
&=\frac{d}{dw}b_1 – \left(\frac{d}{dw}c\right)b_0 – c \frac{d}{dw} b_0 \\
&= BS_{1:T}- \left(\frac{d}{dw}c\right)b_0 – cBS_{0:T-1} \\
\end{aligned}

$BS_{1:T}$ and $cBS_{0:T-1}$ would give matrices of size T x m. But I can't tell what size of $\left(\frac{d}{dw}c\right)$ since $b_0$ is a matrix of size T x 1. Can anyone kindly tell me where the mistake is?

Best Answer

$ \def\a{\alpha}\def\b{\beta}\def\g{c}\def\t{\tau} \def\o{{\tt1}}\def\p{\partial} \def\LR#1{\left(#1\right)} \def\BR#1{\big(#1\big)} \def\trace#1{\operatorname{Tr}\LR{#1}} \def\qiq{\quad\implies\quad} \def\grad#1#2{\frac{\p #1}{\p #2}} \def\c#1{\color{red}{#1}} \def\CLR#1{\c{\LR{#1}}} \def\fracLR#1#2{\LR{\frac{#1}{#2}}} $For typing convenience, define the variables $$\eqalign{ B_0 &= BS_{0}, \quad b_0 = B_0w, \quad db_0 = B_0\,dw \\ B_1 &= BS_{1}, \quad b_1 = B_1w, \quad db_1 = B_1\,dw \\ x &= \LR{\g b_0-b_1} = \LR{\g B_0-B_1}w \\ \a &= \|x\|^2_F = Ta \\ g &= \fracLR{{B_1^Tb_0}+{B_0^Tb_1}-2\g\,B_0^Tb_0}{b_0^Tb_0} \\ }$$ and the Frobenius product, which is a nice notation for the trace $$\eqalign{ A:B &= \sum_{i=1}^m\sum_{j=1}^n A_{ij}B_{ij} \;=\; \trace{A^TB} \\ A:A &= \|A\|^2_F \\ }$$ The properties of the underlying trace function allow the terms in such a product to be rearranged in many different but equivalent ways, e.g. $$\eqalign{ A:B &= B:A \\ A:B &= A^T:B^T \\ C:\LR{AB} &= \LR{CB^T}:A \\&= \LR{A^TC}:B \\ }$$


Use the above notation to calculate the desired gradient.

First, calculate the differential of $\g$ $$\eqalign{ \g &= \frac{b_0:b_1}{b_0:b_0} \\ d\g &= \fracLR{\LR{b_0:db_1}+\LR{b_1:db_0}}{b_0:b_0} - \fracLR{\LR{b_0:b_1}\LR{2\,{b_0:db_0}}}{(b_0:b_0)^2} \\ &= \frac{\LR{b_0:db_1}+\LR{b_1:db_0}-2\g\,b_0:db_0}{b_0:b_0} \\ &= \frac{\LR{b_0:B_1dw}+\LR{b_1:B_0dw}-2\g\,b_0:B_0dw}{b_0:b_0} \\ &= \fracLR{{B_1^Tb_0}+{B_0^Tb_1}-2\g\,B_0^Tb_0}{b_0^Tb_0}:dw \\ &= g^Tdw \\ }$$ then the differential of $\a$ $$\eqalign{ \a &= {x:x} \\ d\a &= {2x:dx} \\ &= 2x:\BR{b_0\,\c{dc} +\g B_0\,dw -B_1\,dw} \\ &= 2x:\LR{B_0w\c{g^T} +\g B_0 -B_1}\c{dw} \\ &= 2\BR{B_0wg^T +\g B_0 -B_1}^T\c{x}:dw \\ &= 2\BR{B_0wg^T +\g B_0 -B_1}^T\c{\LR{\g B_0-B_1}w}:dw \\ }$$ and finally the differential and gradient of $f$ $$\eqalign{ f &= \frac 12\LR{\log\LR{\frac{\a}{T}}+\o} \\ &= \frac 12\BR{\log(\a)+\o-\log(T)} \\ df &= \frac 12\LR{\frac{d\a}{\a}} \\ &= \fracLR{\LR{B_0wg^T+\g B_0-B_1}^T\LR{\g B_0-B_1}w}{\a}:dw \\ &= \fracLR{\LR{S_0wg^T+\g S_0-S_1}^TB^TB\LR{\g S_0-S_1}w}{\a}:dw \\ \grad{f}{w} &= \frac{\LR{S_0wg^T+\g S_0-S_1}^TB\LR{\g S_0-S_1}w}{a/T} \\ }$$ where the last two lines take advantage of the fact that $\;B^2=B=B^T$