[Math] Matrix calculus in multiple linear regression OLS estimate derivation

derivativesleast squareslinear algebramatricesmatrix-calculus

The steps of the following derivation are from here

Starting from $y= Xb +\epsilon $, which really is just the same as

$\begin{bmatrix}
y_{1} \\
y_{2} \\
\vdots \\
y_{N}
\end{bmatrix}
=
\begin{bmatrix}
1 & x_{21} & \cdots & x_{K1} \\
1 & x_{22} & \cdots & x_{K2} \\
\vdots & \ddots & \ddots & \vdots \\
1 & x_{2N} & \cdots & x_{KN}
\end{bmatrix}
*
\begin{bmatrix}
b_{1} \\
b_{2} \\
\vdots \\
b_{K}
\end{bmatrix}
+
\begin{bmatrix}
\epsilon_{1} \\
\epsilon_{2} \\
\vdots \\
\epsilon_{N}
\end{bmatrix} $

it all comes down to minimzing $e'e$:

$\epsilon'\epsilon = \begin{bmatrix}
e_{1} & e_{2} & \cdots & e_{N} \\
\end{bmatrix}
\begin{bmatrix}
e_{1} \\
e_{2} \\
\vdots \\
e_{N}
\end{bmatrix} = \sum_{i=1}^{N}e_{i}^{2}
$

So minimizing $e'e'$ gives us:

$min_{b}$ $e'e = (y-Xb)'(y-Xb)$

$min_{b}$ $e'e = y'y – 2b'X'y + b'X'Xb$

(*) $\frac{\partial(e'e)}{\partial b} = -2X'y + 2X'Xb \stackrel{!}{=} 0$

$X'Xb=X'y$

$b=(X'X)^{-1}X'y$

I'm pretty new to matrix calculus, so I was a bit confused about (*).

In step (*), $\frac{\partial(y'y)}{\partial b} = 0$, which makes sense. And then $\frac{\partial(-2b'X'y)}{\partial b} = -2X'y$, but why exactly is this true? If it were $\frac{\partial(-2b'X'y)}{\partial b'}$, then that would make perfect sense to me. Is taking the partial derivative with respect to $b$ the same as taking the partial derivative with respect to $b'$?

Similarly, $\frac{\partial(b'X'Xb)}{\partial b} = X'Xb$ Why is this true? Shouldn't it be $= b'X'X$?

Best Answer

Consider the full matrix case of the regression $$\eqalign{ Y &= XB+E \cr E &= Y-XB \cr }$$ In this case the function to be minimized is $$\eqalign{f &= \|E\|^2_F = E:E}$$ where colon represents the Frobenius Inner Product.

Now find the differential and gradient $$\eqalign{ df &= 2\,E:dE \cr &= -2\,E:X\,dB \cr &= 2\,(XB-Y):X\,dB \cr &= 2\,X^T(XB-Y):dB \cr\cr \frac{\partial f}{\partial B} &= 2\,X^T(XB-Y) \cr }$$ Set the gradient to zero and solve $$\eqalign{ X^TXB &= X^TY \cr B &= (X^TX)^{-1}X^TY \cr }$$ This result remains valid when $B$ is an $(N\times 1)$ matrix, i.e. a vector.

The problem is that, in the vector case, people tend to write the function in terms of the transpose product instead of the inner product, and then fall into rabbit holes concerning the details of the transpositions.