Deriving the normal equation for linear regression

linear algebralinear regressionmultivariable-calculus

I have been looking at different derivations of the normal equation for linear regression. The derivation that I could follow best (https://www.youtube.com/watch?v=K_EH2abOp00) has some errors, unfortunately. So I tried to do the derivation myself without these errors. I am hung at a single step in the derivation. That step is the "something something" step. Everything seems to fit with the exception that I seem to need a $- 2 X^Ty$ but only have a $- y^TX – X^Ty$. Can you help me spot the error or tell me why the "something something" step is actually correct?

Here is my derivation:

Let $X$ be an $n \times p$ data matrix of $n$ items 'a $p$ features. Let $\hat{\beta}$ be the weight vector that minimizes the residual sum of squares (RSS) of $y – \hat{y}$, where $\hat{y} – X\hat{\beta}$. Hence, To find $\hat{\beta}$, find the minimum of RSS.

\begin{align*}
\text{RSS} & = (y – \hat{y})^T (y – \hat{y})\\
& = (y – X\hat{\beta})^T (y – X\hat{\beta})\\
& = (y^T – \hat{\beta}^TX^T) (y – X\hat{\beta})\\
& = y^Ty – y^TX\hat{\beta} – \hat{\beta}^TX^Ty + \hat{\beta}^TX^TX\hat{\beta}
\end{align*}

\begin{align*}
\frac{\partial ~ \text{RSS}}{\partial \hat{\beta}} & = \frac{\partial y^Ty – y^TX\hat{\beta} – \hat{\beta}^TX^Ty + \hat{\beta}^TX^TX\hat{\beta}}{\partial \hat{\beta}}\\
& = \frac{\partial y^Ty}{\partial\hat{\beta}} – \frac{\partial \overbrace{y^TX\hat{\beta}}^{Ax}}{\partial\hat{\beta}} – \frac{\partial \overbrace{\hat{\beta}^TX^Ty}^{x^TA}}{\partial\hat{\beta}} + \frac{\partial \overbrace{\hat{\beta}^TX^TX\hat{\beta}}^{x^TAx}}{\partial \hat{\beta}} \\
& \text{by these rules:} \frac{d~ Ax}{dx} = A, \frac{d~ x^TA}{dx} = A, \frac{d~ x^TAx}{dx} = 2Ax\\
& = 0 – \underbrace{y^TX}_{A} – \underbrace{X^Ty}_{A} + \underbrace{2X^T X \hat{\beta}}_{2Ax}
\end{align*}

\begin{align*}
0 & = \frac{\partial ~ \text{RSS}}{\partial \hat{\beta}}\\
& = – y^TX – X^Ty + 2X^T X \hat{\beta} \\
& \text{something something}\\
& = – 2 X^Ty + 2X^T X \hat{\beta} \\
\implies\\
& X^Ty = X^T X \hat{\beta} \\
\implies\\
& ( X^T X)^{-1} X^Ty = \hat{\beta} \\
\end{align*}

Edit:

Ok, so after adjusting for a consistent layout as mentioned in the accepted answer, the derivation is this:

\begin{align*}
\text{RSS} & = (y – \hat{y})^T (y – \hat{y})\\
& = (y – X\hat{\beta})^T (y – X\hat{\beta})\\
& = (y^T – \hat{\beta}^TX^T) (y – X\hat{\beta})\\
& = y^Ty – y^TX\hat{\beta} – \hat{\beta}^TX^Ty + \hat{\beta}^TX^TX\hat{\beta}
\end{align*}

\begin{align*}
\frac{\partial ~ \text{RSS}}{\partial \hat{\beta}} & = \frac{\partial y^Ty – y^TX\hat{\beta} – \hat{\beta}^TX^Ty + \hat{\beta}^TX^TX\hat{\beta}}{\partial \hat{\beta}}\\
& = \frac{\partial y^Ty}{\partial\hat{\beta}} – \frac{\partial \overbrace{y^TX\hat{\beta}}^{Ax}}{\partial\hat{\beta}} – \frac{\partial \overbrace{\hat{\beta}^TX^Ty}^{x^TA}}{\partial\hat{\beta}} + \frac{\partial \overbrace{\hat{\beta}^TX^TX\hat{\beta}}^{x^TAx}}{\partial \hat{\beta}} \\
& \text{by these rules:} \frac{d~ Ax}{dx} = A, \frac{d~ x^TA}{dx} = A^T, \frac{d~ x^TAx}{dx} = 2x^TA\\
& = 0 – \underbrace{y^TX}_{A} – \underbrace{(X^Ty)^T}_{A^T} + \underbrace{2 \hat{\beta}^T X^T X }_{2x^TA}\\
& = 2 \hat{\beta}^T X^T X – 2 y^TX
\end{align*}

\begin{align*}
0 & = \frac{\partial ~ \text{RSS}}{\partial \hat{\beta}}\\
& = 2 \hat{\beta}^T X^T X – 2 y^TX\\
\implies\\
& Xy^T = \hat{\beta}^T X^T X \\
\implies\\
& \hat{\beta}^T = Xy^T (X^T X)^{-1} \\
\implies\\
& \hat{\beta} = (X^T X)^{-1} X^Ty \\
\end{align*}

Best Answer

If you're following a Denominator layout then the issue might be with your first derivative rule, which should have been: $$ \frac{\partial Ax}{\partial x} = A^T $$ (see Identities)

This should turn your $y^TX$ into $X^Ty$ and you'd end up with $$ 0 = -2X^Ty + 2X^TX\hat{\beta} \implies (X^TX)^{-1} X^Ty = \hat{\beta} $$

EDIT: I've looked at the video and the author seems to be using the Nominator layout, in which case the second derivative rule should have been: $$ \frac{\partial x^TA}{\partial x} = A^T $$ which turns $X^Ty$ into $y^TX$ instead.

Related Question