Derivative of $max(0, \mathbf{x})$ for the vector $\mathbf{x} \in R^n$

matrix-calculus

I'm reading the tutorial The Matrix Calculus You Need For Deep Learning: https://arxiv.org/abs/1802.01528. In Page 25, the derivative of the ReLu function $\text{max}(0, \mathbf{x})$, where the variable $\mathbf{x}$ is a vector $\in R^n$, is given as follows:

My question is, why is the derivative a vector instead of a diagnol matrix as follows?

\begin{align*}
\frac{\partial}{\partial \mathbf{x}}max(0, \mathbf{x})
&= diag(
\frac{\partial}{\partial x_1}max(0, x_1),
\frac{\partial}{\partial x_2}max(0, x_2),
\dotsc,
\frac{\partial}{\partial x_n}max(0, x_n)
) \\
\end{align*}

The result of the ReLu function $max(0, \mathbf{x})$ is a vector, and the derivative of a vector with respect to a vector variable is a Jacobian matrix. In this case, though, the Jacobian matrix happens to be diagonal too.

Page 7 of the same tutorial presents a general rule as below. I'm not sure how this does not apply to the derivative of ReLu function.

Best Answer

You are correct, by definition the form should be a matrix. However, in this case all terms of the off-diagonal evaluate to zero. Thus, when applying the gradient $H$ to an arbitrary vector $v$, it holds

$$Hv = diag(H) \odot v = h \odot v$$

Therefore, it is often simpler/more efficient to calculate only the diagonal terms ($h$) and employ the Hadamard/element-wise ($\odot$) product instead of doing the full matrix product. This is probably what your reference does.

Related Solutions

[Math] Partial Derivative of Gaussian function: Matrix differentiation

For convenience, let $$\eqalign{ y &= x-k \cr M &= M^T = \Sigma^{-1}\cr }$$ Write the function in terms of these variables and the Frobenius (:) Inner Product and find its differential $$\eqalign{ g &= M:yy^T \cr\cr dg &= M:(dy\,y^T+y\,dy^T)\cr &= (M+M^T)y:dy \cr &= 2\,My:dy \cr &= 2\,\Sigma^{-1}(x-k):dx \cr }$$ Since $dg=\big(\frac{\partial g}{\partial x}:dx\big),\,$ the gradient is $$\eqalign{ \frac{\partial g}{\partial x} &= 2\,\Sigma^{-1}(x-k) \cr }$$ To find the derivative wrt $x_1$ dot the gradient with the $1^{st}$ basis vector $$\eqalign{ \frac{\partial g}{\partial x_1} &= e_1^T \,\frac{\partial g}{\partial x} \cr &= 2\,e_1^T\,\Sigma^{-1}(x-k) \cr }$$

[Math] How to compute the directional derivative of a vector field

Suppose the vector-valued function $\mathbf{f}:\mathbb{R}^n\rightarrow\mathbb{R}^m$ has the (total) derivative at $\mathbf{x_0}\in \mathbb{R}^n$ denoted by $\mathrm{d}_\mathbf{x_0}\mathbf{f}$. It is a linear transformation from $\mathbb{R}^n$ to $\mathbb{R}^m$. It gives the (total) differential of the function $\mathbf{f}$ at $\mathbf{x_0}$ as a function mapping from $\mathbb{R}^n$ to $\mathbb{R}^m$ by applying to the vector variable $\mathbf{x}$ near $\mathbf{x_0}$ to give $\mathrm{d}_\mathbf{x_0}\mathbf{f}\left(\mathbf{x}-\mathbf{x_0}\right)$. With respect to standard basis sets $\left\{\mathbf{\hat{a}}_i\right\}_{i=1}^{n}$ and $\left\{\mathbf{\hat{b}}_i\right\}_{i=1}^{m}$ of $\mathbb{R}^n$ and $\mathbb{R}^m$, respectively, the total derivative $\mathrm{d}_\mathbf{x_0}\mathbf{f}$ corresponds to the $ m \times n$ matrix called the Jacobian matrix

$$\left(\mathrm{d}_\mathbf{x_0}\mathbf{f}\right)=\left(\begin{matrix}\frac{\partial f_1}{\partial x_1}&&\cdots&&\frac{\partial f_1}{\partial x_n}\\\vdots&&\ddots&&\vdots\\\frac{\partial f_m}{\partial x_1}&&\cdots&&\frac{\partial f_m}{\partial x_n}\end{matrix}\right),\\\mathbf{x_0}=x_i \mathbf{\hat{a}}_i,\mathbf{f}\left(\mathbf{x}\right)=f_i\left(\mathbf{x}\right)\mathbf{\hat{b}}_i$$

On the other hand, the gradient of $\mathbf{f}$ donoted by $\nabla\mathbf{f}$ is a linear transformation from $\mathbb{R}^m$ back to $\mathbb{R}^n$, defined, with respect to the same standard basis sets, so that the corresponding matrix of it is the $n \times m$ matrix

$$\left(\nabla\mathbf{f}\right)=\left(\begin{matrix}\frac{\partial f_1}{\partial x_1}&&\cdots&&\frac{\partial f_m}{\partial x_1}\\\vdots&&\ddots&&\vdots\\\frac{\partial f_1}{\partial x_n}&&\cdots&&\frac{\partial f_m}{\partial x_n}\end{matrix}\right)$$

Note, at least with respect to standard basis sets, that the gradient is the transpose of the total derivative.

The variation of function $\mathbf{f}$ at $\mathbf{x_0}$ in the direction of unit vector $\mathbf{u} \in \mathbb{R}^n$, i.e. the directional derivative of $\mathbf{f}$, denoted by $\mathrm{D}_\mathbf{u}\mathbf{f}\left(\mathbf{x_0}\right)$ is a vector in $\mathbb{R}^m$ given by applying the total derivative on $\mathbf{u}$,

$$\mathrm{D}_\mathbf{u}\mathbf{f}\left(\mathbf{x_0}\right)=\mathrm{d}_\mathbf{x_0}\mathbf{f}\mathbf{u}$$

In the special case of $m=1$, i.e. scalar valued function, the rhs of the equation above is a product of a $1\times n$ matrix and an $n\times1$ matrix, leading to a scalar. It happens that, in this particular case, the matrix product equals also to the dot product of the gradient of the function and the unit vector. That's why when talking about scalar functions the textbooks always link the gradient with the directional derivative by the dot product as rule of calculation. However, we cannot generalize directly the dot-product rule to vector valued functions.

About your appendix, if we use the tensor product of the base vectors of two vector spaces as the basis to express a linear transform between these two vector space, we must be careful about the dimensions. In fact in tensor analysis we already have a rigorous and general definition. But here suppose we redefined something only for the specific purpose describe by this question.

Let $\mathcal{V}_n$ denote an $n$-dimensional vector space on $\mathbb{R}$. A linear transformation $\mathbf{A}:\mathcal{V}_n\rightarrow\mathcal{W}_m$ has its $m\times n$ matrix representation $A_{ji}$ under basis sets $\left\{\mathbf{\hat{e}}_i\right\}\in\mathcal{V}_n$ and $\left\{\mathbf{\hat{f}}_i\right\}\in\mathcal{W}_m$ can be obtained by acting on the former by $\mathbf{A}$ to give $n$ vectors $\mathbf{u}_i=A_{ji}\mathbf{\hat{f}}_j $. When acting on a vector $\mathbf{c}\in\mathcal{V}_n$ we get $\mathbf{d}\in\mathcal{W}_m$, $\mathbf{d}=\mathbf{Ac}$. Under the basis set we have the matrix calculation of this transformation

$$\left(\begin{matrix}d_1\\\vdots\\d_m\end{matrix}\right)=\left(\begin{matrix}A_{11}&&\cdots&&A_{1n}\\\vdots&&&&\vdots\\A_{m1}&&\cdots&&A_{mn}\end{matrix}\right)\left(\begin{matrix}c_1\\\vdots\\c_n\end{matrix}\right)$$

or $d_j=A_{ji}c_i$.

On the other hand, if under a certain definition of tensor product of two vectors, the tensor product of $\mathbf{v}\in\mathcal{V}_n$ and $\mathbf{w}\in\mathcal{W}_m$ is expressed with respect to the same basis sets as $\mathbf{v}\otimes\mathbf{w}=v_i w_j \mathbf{\hat{e}}_i\otimes\mathbf{\hat{f}}_j$ the resulted tensor corresponds to the $n\times m$ matrix representation $v_iw_j$. To construct a tensor that can act on vectors of $\mathcal{V}_n$ by $\mathbf{v}$ and $\mathbf{w}$ we have to use $\mathbf{w}\otimes\mathbf{v}$.

Therefore, to express the linear transformation $\mathbf{A}$ by the two basis sets, it should be in the form $\mathbf{A}=A^\prime_{ij}\mathbf{\hat{f}}_i\otimes\mathbf{\hat{e}}_j$. To see the relation between the two $m\times n$ matrices $A_{ji}$ and $A^\prime_{ij}$ we applied again on vector $\mathbf{c}$, this time using the expression with $A^\prime_{ij}$, and requiring that the results to be $\mathbf{d}$. We get in this case $\mathbf{d}=A^\prime_{ij}c_k\left(\mathbf{\hat{f}}_i\otimes\mathbf{\hat{e}}_j\right)\mathbf{\hat{e}}_k$.

To proceed we need an addition postulation in the present discussion, that it is a rule that

$$\left(\mathbf{w}\otimes\mathbf{v}\right)\mathbf{c}=\mathbf{w}\left(\mathbf{v}\cdot\mathbf{c}\right)$$

Then, $\mathbf{d}=A^\prime_{ij}c_k\left(\mathbf{\hat{f}}_i\otimes\mathbf{\hat{e}}_j\right)\mathbf{\hat{e}}_k=A^\prime_{ij}c_k\mathbf{\hat{f}}_i\left(\mathbf{\hat{e}}_j\cdot\mathbf{\hat{e}}_k\right)$. We again have to end here unless in the special case that $\left\{\mathbf{\hat{e}}_i\right\}$ is orthonormal basis set. In this case $\mathbf{d}=A^\prime_{ij}c_j\mathbf{\hat{f}}_i$ or $d_i=A^\prime_{ij}c_j$. By a comparison and care on the subscripts we know that $A^\prime_{ij}=A_{ij},i=1,\cdots,n,j=1,\cdots,m$.

We can now conclude that the linear transformation $\mathbf{A}:\mathcal{V}_n\rightarrow\mathcal{W}_m$ can be expressed with respect of orthonormal basis sets $\left\{\mathbf{\hat{e}}_i\right\}\in\mathcal{V}_n$ and $\left\{\mathbf{\hat{f}}_i\right\}\in\mathcal{W}_m$ as $\mathbf{A}=A_{ij}\mathbf{\hat{f}}_i\otimes\mathbf{\hat{e}}_j$ under the rule $\left(\mathbf{w}\otimes\mathbf{v}\right)\mathbf{c}=\mathbf{w}\left(\mathbf{v}\cdot\mathbf{c}\right)$.

So the (total) derivative of function $\mathbf{f}$, $\mathrm{d}_\mathbf{x_0}\mathbf{f}$, i.e. the "correct" linear transformation we used to act on a unit vector to get a directional derivative, should be expressed as $\mathrm{d}_\mathbf{x_0}\mathbf{f}=\frac{\partial f_i}{\partial x_j}\mathbf{\hat{b}}_i\otimes\mathbf{\hat{a}}_j$ (since standard basis are orthonormal).

Best Answer

Related Solutions

[Math] Partial Derivative of Gaussian function: Matrix differentiation

[Math] How to compute the directional derivative of a vector field

Related Question