Least Squares – Residual Sum of Squares in Closed Form

least squaresmatrices

In finding the Residual Sum of Squares (RSS) We have:

\begin{equation}
\hat{Y} = X^T\hat{\beta}
\end{equation}

where the parameter $\hat{\beta}$ will be used in estimating the output value of input vector $X^T$ as $\hat{Y}$

\begin{equation}
RSS(\beta) = \sum_{i=1}^n (y_i – x_i^T\beta)^2
\end{equation}

which in matrix form would be

\begin{equation}
RSS(\beta) = (y – X \beta)^T (y – X \beta)
\end{equation}

differentiating w.r.t $\beta$ we get

\begin{equation}
X^T(y – X\beta) = 0
\end{equation}

My question is how is the last step done? How did the derivative get the last equation?

Best Answer

According to Randal J. Barnes, Matrix Differentiation, Prop. 7, if $\alpha=y^TAx$ where $y$ and $x$ are vectors and $A$ is a matrix, we have $$\frac{\partial\alpha}{\partial x}=y^TA\text{ and }\frac{\partial\alpha}{\partial y}=x^TA^T$$ (the proof is very simple). Also according to his Prop. 8, if $\alpha=x^TAx$ then $$\frac{\partial \alpha}{\partial x}=x^T(A+A^T). $$ Therefore in Alecos's solution above, I would rather write $$ \frac{\partial\mathrm{RSS}(\beta)}{\partial\beta}=-y^TX-y^TX+\beta^T(X^TX+XX^T) $$ where the last term is indeed $2\beta^TX^TX$ since $X^TX$ is symmetric and hence $X^TX=XX^T$. This gives us an equation $$ (y^T+b^TX^T)X=0 $$ which provides the same result as in Alecos's answer, if we take the transpose of both sides. I guess Alecos has used a different definition of matrix differentiation than Barnes, but the final result is, of course, correct.

Related Question