[Math] MSE Loss function and derivatives

derivativeslagrange multiplierlinear algebramachine learningoptimization

Let $x_i \in \mathbb{R}^n$, $y_i\in\mathbb{R}$, $i=1,\cdots,l$, be a train set for a linear model on the form $y = w^Tx$ for some $w\in\mathbb{R}^n$.

We have a loss function as mean square error (MSE):
$$L(w) = \frac{1}{l} \sum_{i=0}^l(w^Tx_i-y_i)^2 = \frac{1}{l}||Xw-y||^2,$$
where $X = \begin{bmatrix}x_1^T\\\vdots\\ x_l^T\end{bmatrix}$.

So, can someone explain me why when we make $L'(w) = 0$, we get $w = (X^TX)^{-1}X^Ty$?

Best Answer

All we need to do is to compute the derivative of $L(w)$ and equals it to zero.

If $f(x) = ||x||^2$, then $f'(x) = 2x$. Since $X$ is a linear transformation and $y$ is constant, we have $(Xw-y)' = X$. By the chain rule we have: $$ L'(w) = \frac{1}{l}2(Xw-y)^TX = \frac{1}{l}2( w^TX^TX - y^TX ) $$

If we equals to zero we have $$ \frac{1}{l}2( w^TX^TX - y^TX ) = 0 \Rightarrow w^T X^TX = y^TX \Rightarrow w^T = y^TX(X^TX)^{-1},$$ where the inverse of $X^TX$ exists if only if $\{x_1, \cdots, x_l\}$ generates $\mathbb{R}^n$, and for that we need, at least, $l\geq n$ (it holds for large amounts of data).

Now, transposing $w^T$ we have $$ w = (X^TX)^{-1}X^Ty, $$ as desired.

Related Question