How to simplify calculation of this gradient

least squaresmultivariable-calculusvector analysis

Goal: Given $f(\alpha) :=\dfrac{1}{2} \|y-X\alpha\|_2^2$ I want to show that $\nabla f(\alpha)=0\iff X^T X\alpha = X^T y$ where: $X\in\mathbb{R}^{n\times p},\alpha\in\mathbb{R}^p,y\in\mathbb{R}^n$.


If I compute the gradient according to definition I have no problem showing this. But in the lecture the professor is using some gradient/vector propreties as shortcuts and I want to be able to recognise and use them as well. But when using shortcut my solution is wrong by a transposition and I want to know why it fails. Here are the steps:
$$ \nabla f(\alpha) = \frac{1}{2}\frac{d}{d\alpha}\Big(\|y\|_2^2-2y^T X\alpha+\alpha^TX^TX\alpha\Big) $$
I know that $\dfrac{d}{dx} x^TAx = (A+A^T)x$ so $\dfrac{d}{d\alpha} \alpha^TX^TX\alpha = 2X^TX\alpha$ since $X^TX$ is symmetric. Also $\dfrac{d}{d\alpha} \|y\|_2^2= 0$ goes by itself. The error lies in the last gradient calcul I have to do. Here's what I've done:

Since for $x\in\mathbb{R}^p$ $\dfrac{d}{dx} (x) = E\in\mathbb{R}^{p\times p}$ the identity matrix, then: $\dfrac{d}{d\alpha} 2y^TX\alpha = 2y^TX\cdot E = 2y^TX$. It leads to the result:
$$ \nabla f(\alpha)=0\iff X^T X\alpha = y^T X$$
Which is false by a transposition, the dimensions are not matching so it doesn't really make sense. So question is: Why doesn't my shortcut work in this case, and in what other easy way can I make this work ?

Note: The same problem arise when I consider $f:\mathbb{R}^n\to \mathbb{R}$, $f(x) = (1,\dots,1)\cdot \begin{pmatrix}x_1\\ \vdots\\ x_n\end{pmatrix}= x_1+\cdots +x_n$. Obviously $\nabla f(x) = \begin{pmatrix} 1\\ \vdots \\ 1\end{pmatrix}\in\mathbb{R}^n$. But using the same shortcut as above we have:
$$ \frac{d}{dx} \big(\;(1,\dots,1)\cdot \begin{pmatrix}x_1\\ \vdots\\ x_n\end{pmatrix}\;\big)=(1,\dots,1)\cdot E_{n\times n} = (1,\dots,1)\in\mathbb{R}^{1\times n} $$
Same, nearly the right answer but not quite.

Best Answer

$\def\p#1#2{\frac{\partial #1}{\partial #2}}$ Explicit use of the inner (dot) product is one way to avoid the transposition error.

Define the vector $w=(X\cdot a-y),\,$ then write the function in terms of this vector and calculate its gradient. $$\eqalign{ f &= \tfrac 12w\cdot w \\ df &= w\cdot dw \;=\; w\cdot(X\cdot da) \;=\; (X^T\!\cdot w)\cdot da \\ \p{f}{a} &= X^T\!\cdot w \;=\; X^T(X\cdot a-y) \\ }$$ Taking the transpose of a matrix is never an issue, but anytime you find yourself taking the transpose of a vector, proceed with caution. Rewriting the equation using dot products is one fool-proof method. It forces you to arrange the expression on each side of a dot product into dimensionally compatible forms.

Another fool-proof method is to use dyadic products, while very complicated expressions often require index notation for a solution.