[Math] Partial derivative of a function of a matrix

calculuspartial derivative

I have this function of a matrix and two vectors:

$$f(x, V, y) = x^T V y$$ where $$V \in R^{d \times d}, x \in R^{d}, y \in R^{d}$$

Basically I need to optimize (find the minimum) of this function with respect to $V$, $x$, $y$. So $V$ here is a matrix of parameters, as well as $x$ and $y$ are vectors of parameters. The basic way to minimize this function is to compute the partial derivative of the function with respect to each parameter and then just use gradient descent or something. I've never computed the partial derivative of a function with respect to a matrix before so I just need someone to clarify that I'm doing this correctly.

My approach:

  • $1$st step convert the matrix multiplication into summation format, so:

    $$f(x, V, y) = \sum_{i}^{d}x_{i} \sum_{j}^{d}V_{i,j} y_{j}$$

  • Compute the partial derivative of $f$ with respect to $V_{i,j}$:

    $$\frac{\partial f}{\partial V_{i,j}} = \sum_{i}^{d}x_{i} \sum_{j}^{d} y_{j}$$

  • Compute the partial derivative of $f$ with respect to $x_{i}$

    $$\frac{\partial f}{\partial x_{i}} = \sum_{i}^{d} \sum_{j}^{d}V_{i,j} y_{j}$$ and in a vectorized version this would be: $\frac{\partial f}{\partial \hat x} = V y$

  • Compute the partial derivative of $f$ with respect to $y_{j}$:

    $$\frac{\partial f}{\partial y_{j}} = \sum_{i}^{d}x_{i} \sum_{j}^{d}V_{i,j}$$

My questions:

  • Are my calculations correct?

  • Is it possible to convert the calculations of the partial derivatives into a vectorized format so I can calculate the partial derivatives of all the $V$ parameters in one shot? I need that because I need a fast computation!

Best Answer

Let $f \, : \, \mathbb{R}^{d} \times \mathcal{M}_{d}(\mathbb{R}) \times \mathbb{R}^{d} \, \longrightarrow \, \mathbb{R}$ such that :

$$ f(x,V,y) = x^{\top}Vy $$

$\nabla_{x}f$ (resp. $\nabla_{V}f$, resp. $\nabla_{y}f$) denote the gradient of $f$ with respect to the variable $x$ (resp. $V$, resp. $y$), in other words, the gradient of the linear map $x \, \longmapsto \, f(x,V,y)$ (resp. $V \, \longmapsto \, f(x,V,y)$, resp. $y \, \longmapsto \, f(x,V,y)$). We have :

$$ \left\{ \begin{array}{l} \nabla_{x}f \, (x,V,y) = Vy \\[2mm] \nabla_{V}f \, (x,V,y) = xy^{\top} \\[2mm] \nabla_{y}f \, (x,V,y) = V^{\top}x \\ \end{array} \right. $$


-- Update :

Let $(E,\left\langle \cdot,\cdot \right\rangle)$ be an euclidean space of dimension $n$. Let $f \, : \, E \, \longrightarrow \, \mathbb{R}$ and a point $x \in E$. Let $\mathcal{L}(E,\mathbb{R})$ the space of [continuous] linear forms on $E$. By definition, $f$ is differentiable at $x$ if there exist a linear form $ \in \mathcal{L}(E,\mathbb{R})$ such that :

$$ f(x+h) = f(x) + L(h) + \mathop{o} \limits_{\Vert h \Vert \to 0}\big( \Vert h \Vert \big)$$

$L$ is unique and I will use the notation : $L=\mathrm{D}_{x}f$ ( $\mathrm{D}_{x}f$ is the differential of $f$ at $x$) and $L(h) = \mathrm{D}_{x}f \cdot h$. Since $\mathrm{D}_{x}f$ is in $\mathcal{L}(E,\mathbb{R})$, Riesz representation theorem ensures that there exist a unique $x_{f} \in E$ such that : $\forall h, \, \mathrm{D}_{x}f \cdot h = \left\langle h,x_{f} \right\rangle$ (one could say that the linear form $\mathrm{D}_{x}f$ is "represented" by the vector $x_{f}$ throught the inner product $\left\langle \cdot,\cdot \right\rangle$). Since $x_{f}$ is unique, we say that the vector $x_{f}$ is the gradient of $f$ at $x$ and we write usually $x_{f} = \nabla f(x)$.

If $E = \mathbb{R}^{n}$ and $\left\langle \cdot,\cdot \right\rangle$ is the usual inner product, then : $\mathrm{D}_{x}f \cdot h = \left\langle h,\nabla f (x) \right\rangle = h^{\top} \nabla f (x)$.

If $E = \mathcal{M}_{d}(\mathbb{R})$ (the space of real square matrices of size $d$) and $\left\langle A,B \right\rangle = \mathrm{Tr}(A^{\top}B)$, then : $D_{x}f \cdot h = \left\langle h,\nabla f(x) \right\rangle = \mathrm{Tr}(h^{\top} \nabla f(x))$. (Note that in this case, $\nabla f(x)$ is a matrix, since it is an element of $E$).


So, how could we proceed in order to get the gradient of $\varphi \, : \, x \, \longmapsto \, x^{\top}Vy$ ($V$ and $y$ are fixed here) ? Write :

$$ \begin{align*} \varphi(x+h) &= {} \big( x+h \big)^{\top}Vy \\[2mm] &= x^{\top}Vy + h^{\top}Vy \\ \end{align*} $$

In $\varphi(x+h)$, identify the $\varphi(x)$ part and the part which is linear in $h$, the differential of $\varphi$ at $x$ :

$$ \varphi(x+h) = \underbrace{x^{\top}Vy}_{\varphi(x)} + \underbrace{\color{red}{h^{\top}Vy}}_{\mathrm{D}_{x}\varphi \, \cdot \, h} $$

Now, we know that the differential and the gradient are linked by the relation : $\mathrm{D}_{x}\varphi \cdot h = h^{\top} \nabla \varphi (x)$ so we identify $h^{\top}Vy$ and $h^{\top} \nabla \varphi (x)$ :

$$ \nabla \varphi (x) = Vy $$

We can do the same for $\psi \, : \, V \, \longmapsto \, x^{\top}Vy$ :

$$ \begin{align*} \psi(V+H) &= {} x^{\top}\big( V+H \big)y \\[3mm] &= \underbrace{x^{\top}Vy}_{\psi(V)} + \underbrace{\color{red}{x^{\top}Hy}}_{\mathrm{D}_{V} \psi \, \cdot \, H} \\ \end{align*} $$

Here, $\mathrm{D}_{V}\psi \cdot H$ is not yet of the form $\mathrm{Tr}(H^{\top} \nabla \psi(V))$ so let's write :

$$ \begin{align*} x^{\top}Hy &= {} \mathrm{Tr}\big( x^{\top}Hy \big) \\[2mm] &= \mathrm{Tr}\big( yx^{\top}H \big) \\[2mm] &= \mathrm{Tr}\big( H^{\top}xy^{\top} \big) \\ \end{align*} $$

As a consequence :

$$ \nabla \psi(V) = xy^{\top}. $$