[Math] Derivation of SSE gradient for Linear Regression

linear algebramatricesregression

In my text book it gives the derviation of the gradient for the likelihood of a linear regression model (to minimize the negative log likelihood by minimizing the Sum Squared Error). The first line looks like this:

$$NLL(\mathbf{w}) = \frac{1}{2}(\mathbf{y} – \mathbf{Xw})^T(\mathbf{y} – \mathbf{Xw}) = \frac{1}{2}\mathbf{w}^T(\mathbf{X}^T\mathbf{X})\mathbf{w} – \mathbf{w}^T(\mathbf{X}^T\mathbf{y}) $$

$\mathbf{X}$ is the data matrix,
$\mathbf{y}$ is the target output for each datapoint, and
$\mathbf{w}$ is the regression weights.

I'm not sure how they got from the first representation to the second. When I expand the first term by distributing the transpose and the multiplication, I get something different, shown below.

$$ \frac{1}{2}(\mathbf{y}^T\mathbf{y} – \mathbf{w}^T\mathbf{X}^T\mathbf{y} – \mathbf{y}^T\mathbf{Xw} + \mathbf{w}^T\mathbf{X}^T\mathbf{Xw}) $$

Can someone fill in the steps for me?

Best Answer

Firstly, please note that your second term in the brackets should be $w^TX^Ty$ (the $y$ is not transposed).

Next, I think there are two things going on here:

  1. Since the gradient will be taken with respect to $w$, the term $y^Ty$ has been ignored
  2. Since $w^TX^Ty$ is a scalar, $(w^TX^Ty)=(w^TX^Ty)^T=y^TXw$.

So, ignoring the $y^Ty$ term, your expression in the brackets is

$\frac{1}{2}(-2w^TX^Ty+w^TX^TXw)=\frac{1}{2}w^TX^TXw-w^TX^Ty$, which is the expression you need.