Solved – How to derive the gradient for weighted least squares

So in my previous "adventures in statsland" episode, I believe I was able to convert the weighted sum of squares cost function into matrix form (Formula $\ref{cost}$).
$$
J(w) = (Xw – y)^T U(Xw-y) \tag{1}\label{cost}
$$

Where $X$ is an $m \times n$ input matrix, $w$ is an $n \times 1$ column matrix representing the weights, $y$ is an $m \times 1$ matrix representing your output, and $U$ is an $m \times m$ diagonal matrix where each element $u_{mm}$ weighs the respective input.

Now I am trying to get the gradient of this function with respect to $w$. I've followed the technique outlined in this blog post which derives the gradient for ordinary least squares. However, since $X$ is multiplied by $U$ in our case (the weighing part), the matrices become unwieldy.

Although I'm missing the gradient, I know that the weighted least squares estimage of $w$ is:
$$
(X^T UX)^{-1} X^T Uy
$$

Does anyone know how to derive the gradient or point me to somewhere? I've been searching for hours. Thanks!

Best Answer

$$J (\mathrm w) = (\mathrm X \mathrm w - \mathrm y)^T \mathrm U (\mathrm X \mathrm w - \mathrm y) = \cdots = \mathrm w^T \mathrm X^T \mathrm U \mathrm X \mathrm w - 2 \mathrm w^T \mathrm X^T \mathrm U \mathrm y + \mathrm y^T \mathrm U \mathrm y$$

Taking the gradient,

$$\nabla_{\mathrm w} J (\mathrm w) = 2 \mathrm X^T \mathrm U \mathrm X \mathrm w - 2 \mathrm X^T \mathrm U \mathrm y = 2 \mathrm X^T \mathrm U (\mathrm X \mathrm w - \mathrm y)$$

which vanishes at the solution to the linear system

$$\mathrm X^T \mathrm U \mathrm X \mathrm w = \mathrm X^T \mathrm U \mathrm y$$

If $\mathrm X$ has full column rank and $\mathrm U$ has no zero entries on the main diagonal, the unique solution is

$$\hat{\mathrm w} = (\mathrm X^T \mathrm U \mathrm X)^{-1} \mathrm X^T \mathrm U \mathrm y$$

Best Answer

Related Solutions

Solved – What are the steps to convert weighted sum of squares to matrix form

Solved – Neural nets: How to get the gradient of the cost function from the gradient evaluated for each observation

Related Question