[Math] How to apply gradient with respect to a vector

derivativeslinear algebralinear regressionmultivariable-calculusvector analysis

In Deep Learning (adapted from page 108), explaining linear regression as a machine learning algorithm, there is a passage for the solution of this expression:

To minimize $MSE$, we can simply solve for where its gradient is $0$:
$$\nabla_{\mathbf w}MSE = 0$$

In addition, $\hat{\mathbf{y}}$ is defined as the prediction of the linear regression (also defined as $\mathbf X \mathbf w$, where $\mathbf{X}$ is the matrix of inputs and $\mathbf{w}$ is the weights vector), while $\mathbf{y}$ is defined as the real output value.

The solution follows this path:

$$\nabla_{\mathbf w}MSE = 0$$
$$\Rightarrow \nabla_{\mathbf w}\frac{1}{m}\lvert\lvert \hat {\mathbf{y}}-{\mathbf{y}}\rvert\rvert_2^2= 0$$
$$\Rightarrow \frac{1}{m} \nabla_{\mathbf w}\lvert\lvert {\mathbf{X}}\mathbf w -{\mathbf{y}}\rvert\rvert_2^2= 0$$
$$\Rightarrow \nabla_{\mathbf w} ( {\mathbf{X}}\mathbf w -{\mathbf{y}} )^{T} ( {\mathbf{X}}\mathbf w -{\mathbf{y}} ) = 0$$
$$\Rightarrow \nabla_{\mathbf w} ( \mathbf{w}^T{\mathbf{X}}^{T}{\mathbf{X}}\mathbf w – 2\mathbf{w}^T{\mathbf{X}}^{T}{\mathbf{y}} + {\mathbf{y}}^{T}{\mathbf{y}} ) = 0$$

Now, the subsequent step is:

$$\Rightarrow ( 2{\mathbf{X}}^{T}{\mathbf{X}}\mathbf w – 2{\mathbf{X}}^{T}{\mathbf{y}}) = 0$$

I think to understand that it takes the vector derivative with respect to $\mathbf{w}$, however I could not find the exact term of this derivative and consequently its rules to carry out the derivative myself (in particular, how to deal with transposed vectors and matrices).

Best Answer

Using matrix transpose notation with vectors often confuses me. So I prefer to expand the norm using an explicit dot product instead $$\eqalign{ \|z\|^2_2 &= z\cdot z \cr }$$ In this form, finding the differential and the gradient of the norm is straightforward $$\eqalign{ d\|z\|^2_2 &= 2z\cdot dz \cr \frac{\partial\|z\|^2_2}{\partial z} &= 2z \cr\cr }$$ Now repeat the calculation for $\,\,z=(X\cdot w-y)$ $$\eqalign{ d\|z\|^2_2 &= 2z\cdot dz \cr &= 2z\cdot (X\cdot dw) \cr &= 2(X^T\cdot z)\cdot dw \cr \cr \frac{\partial\|z\|^2_2}{\partial w} &= 2X^T\cdot z \cr &= 2X^T\cdot(X\cdot w-y) \cr\cr }$$

Related Question