[Math] What are the steps to convert weighted sum of squares to matrix form

linear algebramachine learningmatrix equationsregression

I'm new to converting formulas to matrix form. But this is required for efficient machine learning code. So I want to understand the "right" way, not the cowboy stuff I do.

Alright here we go, I'm trying to convert weighted sum of squares from the form below into matrix form.
$$J(w)=\sum_{i=1}^m u_i (w^T x_i – y_i)^2$$

Where $u_i$ is the weight for each sample error$_i$. Also, $x_i \in \mathbb{R^n}$, $w \in \mathbb{R^n}$, $y \in \mathbb{R}$, $u_i \in \mathbb{R}$,$i=1,…,m$. $w^T x_i$ is the predicted value, the result of multiplying a weight vector by a feature vector.

Here's what I think, and I do get creative. So feel free to skip to the end if I go on a tangent.

Let $r$ be a column vector of functions that represents the non-squared error. We can represent $(w^T x_i – y_i)^2$ over $i=1,…,m$ as
$$ r^2 = \begin{bmatrix}r_1 & r_2 & \cdots & r_m\end{bmatrix}
\begin{bmatrix}
r1 \\
r2 \\
\vdots \\
r_m \\
\end{bmatrix}
\tag{1}\label{1}$$

The results of the $1 \times m $ vector multiplied by the $m \times 1$ vector is a $ 1 \times 1$ matrix (scalar).

Let $u$ be a vector of weights that weighs each sample error. Since we need to weigh the squared errors, we need to incorporate $u$ in Formula $\ref{1}$ before getting the scalar. Since we want the first $r$ to remain as a $1 \times m$ vector, we define $U$ to be a diagonal matrix with the diagonal terms coming from $u$. We now have:

$$ J(w) = \begin{bmatrix}r_1 & r_2 & \cdots & r_m\end{bmatrix}
\begin{bmatrix}
u_1 & 0 & \cdots & 0\\
0 & u_2 & \cdots & 0\\
\vdots & \vdots & \ddots & \vdots\\
0 & 0 & \cdots & u_m\\
\end{bmatrix}
\begin{bmatrix}
r1 \\
r2 \\
\vdots \\
r_m \\
\end{bmatrix}
\tag{2}\label{2}$$

We can simplify this to
$$ J(w) = r^T U r \tag{3}\label{3}$$

Now we expand $r$. We had $x_i \in \mathbb{R^n}$ multiplied by $w \in \mathbb{R^n}$, giving us $Xw$ where X is now an $m \times n$ matrix and $w$ is an $n \times 1$ column vector. Let y be the $m \times 1$ column vector representing the labels $y = 1,…,m$. Now $r = (Xw – y)$. We substitute this into $\ref{3}$, giving us the final weighted sum of squares in matrix form:
$$
J(w) = (Xw – y)^T U(Xw-y) \tag{4}\label{4}
$$

First, does this make sense? Second, and most importantly, is this actually how you're supposed to do it?

Thanks

Best Answer

In general, you are right. If you can specify the weights $u_i$, then you can define the matrix $U^{1/2}=diag (u_i^{1/2},...,u_i^{1/2})$, then multiply your model, i.e., $$ U^{1/2}y=U^{1/2}X\beta + U^{1/2}\epsilon. $$ So, you have to minimize $S(\beta)=(U^{1/2}(y-X\beta))'U^{1/2}(y-X\beta)=(y-X\beta)'U(y-X\beta)$. You can just denote $y*=U^{1/2}y$ and $X*=U^{1/2}X\beta$, thus the OLS result is given by $$ \hat{\beta}=(X*'X*)^{-1}X*'y = (X'UX)^{-1}X'Uy. $$

So, you can see that the WLS is just on OLS performed on a transformed model. As such, you can skip all these steps and just define $U=diag(u_1,...,u_n)$ and then use the final result in order to get the WLS coefficients.

EDIT:


Note that for a general structure of the covariance matrix, you assume $\epsilon \sim MVN (0, \sigma^2 V)$. In this case you will a get a GLS estimator of the form: $$ \hat{\beta} = (X'V^{-1}X)^{-1}X'V^{-1}y, $$ so WLS is just a special case of GLS where $V$ is diagonal, however still differs from $kI$. Now, you have to construct $V^{-1}$, which is a positive definite (diagonal) matrix, and thus can be factorized by $V=V^{-1/2}V^{-1/2}$, so the transformed model is given by $$ y^* = V^{-1/2}y=V^{-1/2}X\beta + V^{-1/2}\epsilon, $$ such that $$ cov(y^*) = cov(V^{-1/2}\epsilon)=\sigma^2V^{-1/2}VV^{-1/2}=\sigma^2V^{-1/2}V^{1/2}V^{1/2}V^{-1/2}=\sigma^2I, $$ that solves the problem. So, for the technicalities, just denote the new matrices by $*$ or something like that and plug in the previous form in the end of the calculations.

Related Question