Computing the gradient of a two-layer neural network w.r.t. the data

gradient descentneural networksoptimizationsolution-verification

I want to verify that I am computing the following gradient correctly. I am looking at this conjecture, and the gradient is paramount. Using the definitions from the the paper, consider the data $\vec{x} = \{ x_1, x_2, …, x_d \} \in \mathbb{R}^d$, 2 vectors $\vec{a} = \{ a_1, a_2, …, a_k \} \in \mathbb{R}^k$, and $\vec{b} = \{ b_1, b_2, …, b_k \} \in \mathbb{R}^k$ and a matrix which we represent as a vector of vectors $\vec{w_l} = \{ w_{l,1}, w_{l,2}, …, w_{l,d}\} \in \mathbb{R}^d$ for $1 \le l \le k$. We also have a Lipschitz function $\varphi$. Define a function
$$f: x \mapsto \sum_{l=1}^k a_l \varphi( \langle \vec{w}_l, \vec{x} \rangle + b_l)$$

where$\langle \vec{w}_l, \vec{x} \rangle =\sum_{i=1}^d w_{l,i}x_i$. I want to properly write out the following gradient vector $\nabla f$ as a function of $x$.
$$\nabla f(x)=\begin{bmatrix} \frac{\partial}{\partial x_1} f\\
\frac{\partial}{\partial x_2} f \\
… \\
\frac{\partial}{\partial x_d} f
\end{bmatrix}$$

Now, for each $x_i$ I compute $\frac{\partial}{\partial x_i} f$:

$$\frac{\partial}{\partial x_i} f = \sum_{l=1}^k a_l \varphi'(\langle \vec{w}_l, \vec{x} \rangle + b_l)w_{l,i}$$

where $\varphi'$ is the derivative of $\varphi$. The conclusion here is that we see each element of $\nabla f$ is itself a sum of a function of all of the data. This is my main doubt. Do we in fact have the entire dot product $\langle \vec{w_l}, \vec{x} \rangle$ inside each element of the sum inside each element of the gradient $\nabla f$? Basically I would just like confirmation this computation is correct. Thanks.

Best Answer

Yes, the computation is correct.

Recall the chain-rule: if $h(t) = f(g(t))$ then $$h'(t) = f'(g(t))g'(t).$$ Note that where $f$ and $f'$ are evaluated does not change. Both are evaluated at $g(t)$.

That is what is going on in your calculation. Before differentiation $\phi$ is evaluated at $\langle \vec w_l, \vec x \rangle + \vec b_l$, and after differentiation $\phi'$ is still evaluated at the same point.

Related Question