Derivative of quadratic form of vector-valued function

derivativelinear algebramatrixmultivariate analysis

This seems like a trivial question but I am currently stuck and cannot see what I am doing wrong.

So let us consider a function $f(x) : \mathbb{R}^d \rightarrow \mathbb{R}^d$.

I want to compute the derivative w.r.t. $x \in \mathbb{R}^d$ of an expression that contains a quadratic form of $f(x)$

$$I = f(x)^{\top} C f(x) . $$

Here $C$ is a $d\times d$ matrix.

By taking the derivative w.r.t to the vector $x$ we have

$$ \frac{\partial I}{\partial x} = 2C f(x) \cdot \nabla f(x), $$
where $\nabla f(x)$ denotes the Jacobian of $f$ which will be a $d \times d$ matrix.

Now my problem is that the dimensions of the matrices in the last expression do not match:
We have

$C: d\times d$,
$f(x): d\times 1$, and
$\nabla f(x): d \times d$.

So the last two dimensions do not add up.
What I am doing wrong? Is the correct derivative $$ \frac{\partial I}{\partial x} = \nabla f(x) 2 C f(x) , $$ or $$ \frac{\partial I}{\partial x} = ( 2 C f(x) )^{\top} \cdot \nabla f(x) $$

Best Answer

Let $(\nabla f)_{ik}=\frac{\partial f_i}{\partial x_k}$. Then derivative is

$$ \frac{\partial I}{\partial x} = (\nabla f)^T C f + (\nabla f)^T C^T f. $$

I think you perhaps implicitly assumed that $C$ is symmetric

Related Solutions

Solved – How to take derivative of multivariate normal density

In chapter 2 of the Matrix Cookbook there is a nice review of matrix calculus stuff that gives a lot of useful identities that help with problems one would encounter doing probability and statistics, including rules to help differentiate the multivariate Gaussian likelihood.

If you have a random vector ${\boldsymbol y}$ that is multivariate normal with mean vector ${\boldsymbol \mu}$ and covariance matrix ${\boldsymbol \Sigma}$, then use equation (86) in the matrix cookbook to find that the gradient of the log likelihood ${\bf L}$ with respect to ${\boldsymbol \mu}$ is

$$\begin{align} \frac{ \partial {\bf L} }{ \partial {\boldsymbol \mu}} &= -\frac{1}{2} \left( \frac{\partial \left( {\boldsymbol y} - {\boldsymbol \mu} \right)' {\boldsymbol \Sigma}^{-1} \left( {\boldsymbol y} - {\boldsymbol \mu}\right) }{\partial {\boldsymbol \mu}} \right) \nonumber \\ &= -\frac{1}{2} \left( -2 {\boldsymbol \Sigma}^{-1} \left( {\boldsymbol y} - {\boldsymbol \mu}\right) \right) \nonumber \\ &= {\boldsymbol \Sigma}^{-1} \left( {\boldsymbol y} - {\boldsymbol \mu} \right) \end{align}$$

I'll leave it to you to differentiate this again and find the answer to be $-{\boldsymbol \Sigma}^{-1}$.

As "extra credit", use equations (57) and (61) to find that the gradient with respect to ${\boldsymbol \Sigma}$ is

$$ \begin{align} \frac{ \partial {\bf L} }{ \partial {\boldsymbol \Sigma}} &= -\frac{1}{2} \left( \frac{ \partial \log(|{\boldsymbol \Sigma}|)}{\partial{\boldsymbol \Sigma}} + \frac{\partial \left( {\boldsymbol y} - {\boldsymbol \mu}\right)' {\boldsymbol \Sigma}^{-1} \left( {\boldsymbol y}- {\boldsymbol \mu}\right) }{\partial {\boldsymbol \Sigma}} \right)\\ &= -\frac{1}{2} \left( {\boldsymbol \Sigma}^{-1} - {\boldsymbol \Sigma}^{-1} \left( {\boldsymbol y} - {\boldsymbol \mu} \right) \left( {\boldsymbol y} - {\boldsymbol \mu} \right)' {\boldsymbol \Sigma}^{-1} \right) \end{align} $$

I've left out a lot of the steps, but I made this derivation using only the identities found in the matrix cookbook, so I'll leave it to you to fill in the gaps.

I've used these score equations for maximum likelihood estimation, so I know they are correct :)

Solved – Derivative of Softmax with respect to weights

The last hidden layer produces output values forming a vector $\vec x = \mathbf x$. The output neuronal layer is meant to classify among $K=1,\dots,k$ categories with a SoftMax activation function assigning conditional probabilities (given $\mathbf x$) to each one the $K$ categories. In each node in the final (or ouput) layer the pre-activated values (logit values) will consist of the scalar products $\mathbf{w}_j^\top\mathbf{x}$, where $\mathbf w_j\in\{\mathbf{w}_1, \mathbf{w}_2,\dots,\mathbf{w}_k\}$. In other words, each category, $k$ will have a different vector of weights pointing at it, determining the contribution of each element in the output of the previous layer (including a bias), encapsulated in $\mathbf x$. However, the activation of this final layer will not take place element-wise (as for example with a sigmoid function in each neuron), but rather through the application of a SoftMax function, which will map a vector in $\mathbb R^k$ to a vector of $K$ elements in [0,1]. Here is a made-up NN to classify colors:

Defining the softmax as

$$ \sigma(j)=\frac{\exp(\mathbf{w}_j^\top \mathbf x)}{\sum_{k=1}^K \exp(\mathbf{w}_k^\top\mathbf x)}=\frac{\exp(z_j)}{\sum_{k=1}^K \exp(z_k)}$$

We want to get the partial derivative with respect to a vector of weights $(\mathbf w_i)$, but we can first get the derivative of $\sigma(j)$ with respect to the logit, i.e. $z_i = \mathbf w_i^\top \cdot \mathbf x$:

$$\begin{align} \small{\frac{\partial}{\partial( \mathbf{w}_i^\top \mathbf x)}}\sigma(j) &= \small{\frac{\partial}{\partial \left(\mathbf{w}_i^\top \mathbf x\right)}}\;\frac{\exp(\mathbf{w}_j^\top \mathbf x)}{\sum_{k=1}^K \exp(\mathbf{w}_k^\top\mathbf x)} \\[2ex] &\underset{*}{=} \frac{\frac{\partial}{\partial (\mathbf{w_i\top \mathbf x)}}\,\exp(\mathbf{w}_j^\top \mathbf x)}{\sum_{k=1}^K \exp(\mathbf{w}_k^\top\mathbf x)}\,-\,\frac{\exp(\mathbf w_j^\top \mathbf x)}{\left(\sum_{k=1}^K \exp(\mathbf{w}_k^\top\mathbf x) \right)^2}\quad\small{{\frac{\partial}{\partial \left(\mathbf w_i^\top\mathbf x\right)}}}\,\sum_{k=1}^K \exp(\mathbf{w}_k^\top\mathbf x)\\[2ex] &= \frac{\delta_{ij}\exp(\mathbf{w}_j^\top \mathbf x)}{\sum_{k=1}^K \exp(\mathbf{w}_k^\top\mathbf x)}\,-\,\frac{\exp(\mathbf w_j^\top \mathbf x)}{ \sum_{k=1}^K \exp\left(\mathbf{w}_k^\top\mathbf x \right)} \frac{\exp(\mathbf{w}_i^\top\mathbf x)}{\sum_{k=1}^K \exp\left(\mathbf{w}_k^\top\mathbf x \right)} \\[3ex] &=\sigma(j)\left(\delta_{ij}-\sigma(i)\right) \end{align}$$

$* \text{- quotient rule}$

Thanks and (+1) to Yuntai Kyong for pointing out that there was a forgotten index in the prior version of the post, and the changes in the denominator of the softmax had been left out of the following chain rule...

By the chain rule,

$$\begin{align}\frac{\partial}{\partial \mathbf{w}_i}\sigma(j)&= \sum_{k = 1}^K \frac{\partial}{\partial (\mathbf{w}_k^\top \mathbf x)}\sigma(j)\quad \frac{\partial}{\partial\mathbf{w}_i}\mathbf{w}_k^\top \mathbf{x}\\[2ex] &=\sum_{k = 1}^K \frac{\partial}{\partial (\mathbf{w}_k^\top \mathbf x)}\;\sigma(j)\quad \delta_{ik} \mathbf{x}\\[2ex] &=\sum_{k = 1}^K\sigma(j)\left(\delta_{kj}-\sigma(k)\right)\quad \delta_{ik} \mathbf{x} \end{align}$$

Combining this result with the previous equation:

$$\bbox[8px, border: 2px solid lime]{\frac{\partial}{\partial \mathbf{w}_i}\sigma(j)=\sigma(j)\left(\delta_{ij}-\sigma(i)\right)\mathbf x}$$

Best Answer

Related Solutions

Solved – How to take derivative of multivariate normal density

Solved – Derivative of Softmax with respect to weights

Related Question