In my text book it gives the derviation of the gradient for the likelihood of a linear regression model (to minimize the negative log likelihood by minimizing the Sum Squared Error). The first line looks like this:
$$NLL(\mathbf{w}) = \frac{1}{2}(\mathbf{y} – \mathbf{Xw})^T(\mathbf{y} – \mathbf{Xw}) = \frac{1}{2}\mathbf{w}^T(\mathbf{X}^T\mathbf{X})\mathbf{w} – \mathbf{w}^T(\mathbf{X}^T\mathbf{y}) $$
$\mathbf{X}$ is the data matrix,
$\mathbf{y}$ is the target output for each datapoint, and
$\mathbf{w}$ is the regression weights.
I'm not sure how they got from the first representation to the second. When I expand the first term by distributing the transpose and the multiplication, I get something different, shown below.
$$ \frac{1}{2}(\mathbf{y}^T\mathbf{y} – \mathbf{w}^T\mathbf{X}^T\mathbf{y} – \mathbf{y}^T\mathbf{Xw} + \mathbf{w}^T\mathbf{X}^T\mathbf{Xw}) $$
Can someone fill in the steps for me?
Best Answer
Firstly, please note that your second term in the brackets should be $w^TX^Ty$ (the $y$ is not transposed).
Next, I think there are two things going on here:
So, ignoring the $y^Ty$ term, your expression in the brackets is
$\frac{1}{2}(-2w^TX^Ty+w^TX^TXw)=\frac{1}{2}w^TX^TXw-w^TX^Ty$, which is the expression you need.