Solved – What are the steps to convert weighted sum of squares to matrix form

I'm new to converting formulas to matrix form. But this is required for efficient machine learning code. So I want to understand the "right" way, not the cowboy stuff I do.

Alright here we go, I'm trying to convert weighted sum of squares from the form below into matrix form. I often see the matrix form as being equivalent to the one below, and no explanation is given on how it is derived.

$$J(w)=\sum_{i=1}^m u_i (w^T x_i – y_i)^2$$

where $u_i$ is the weight for each sample error$_i$. Also, $x_i \in \mathbb{R^n}$, $w \in \mathbb{R^n}$, $y \in \mathbb{R}$, $u_i \in \mathbb{R}$,$i=1,…,m$. $w^T x_i$ is the predicted value, the result of multiplying a weight vector by a feature vector.

Here's what I think, and I do get creative. So feel free to skip to the end if I go on a tangent.

Let $r$ be a column vector of functions that represents the non-squared error. We can represent $(w^T x_i – y_i)^2$ over $i=1,…,m$ as

$$ r^2 = \begin{bmatrix}r_1 & r_2 & \cdots & r_m\end{bmatrix}
\begin{bmatrix}
r_1 \\
r_2 \\
\vdots \\
r_m \\
\end{bmatrix}
\tag{1}\label{1}$$

The results of the $1 \times m $ vector multiplied by the $m \times 1$ vector is a $ 1 \times 1$ matrix (scalar).

Let $u$ be a vector of weights that weighs each sample error. Since we need to weigh the squared errors, we need to incorporate $u$ in Formula $\ref{1}$ before getting the scalar. Since we want the first $r$ to remain as a $1 \times m$ vector, we define $U$ to be a diagonal matrix with the diagonal terms coming from $u$. We now have:

$$ J(w) = \begin{bmatrix}r_1 & r_2 & \cdots & r_m\end{bmatrix}
\begin{bmatrix}
u_1 & 0 & \cdots & 0\\
0 & u_2 & \cdots & 0\\
\vdots & \vdots & \ddots & \vdots\\
0 & 0 & \cdots & u_m\\
\end{bmatrix}
\begin{bmatrix}
r_1 \\
r_2 \\
\vdots \\
r_m \\
\end{bmatrix}
\tag{2}\label{2}$$

We can simplify this to
$$ J(w) = r^T U r \tag{3}\label{3}$$

Now we expand $r$. We had $x_i \in \mathbb{R^n}$ multiplied by $w \in \mathbb{R^n}$, giving us $Xw$ where X is now an $m \times n$ matrix and $w$ is an $n \times 1$ column vector. Let y be the $m \times 1$ column vector representing the labels $y = 1,…,m$. Now $r = (Xw – y)$. We substitute this into Formula $\ref{3}$, giving us the final weighted sum of squares in matrix form:
$$
J(w) = (Xw – y)^T U(Xw-y) \tag{4}\label{4}
$$

First, does this make sense? Second, and most importantly, is this actually how you're supposed to do it?

Thanks

Best Answer

I'll venture an answer to this question: Everything you've presented is correct.

What you've basically derived is the Gauss-Markov theorem: the weighted least squares estimator is the best linear unbiased estimator for weighted data. This estimator minimizes the weighted sum-of-squares (your first display) and is given by: $\hat{\beta}_{WLS} = \left( \mathbf{X}^T\mathbf{W}\mathbf{X} \right) \left( \mathbf{X}^T \mathbf{W} Y \right)$. Here $\mathbf{X}$ is the design matrix with the first column set to $\mathbf{1}$ the $n \times 1$ vector of ones (this is the intercept term).

This result applies to an arbitrary covariance matrix. However, weighted independent data are represented with a vector of weights along the diagonal of the weight matrix. (your notation has $w$ as the regression coefficient and $u$ as the weight, so to avoid confusion, the design matrix would be $\mathbf{X} = [x], \mathbf{W} = \text{diag}(u),$ and $\beta=[w]$.

The proof of the Gauss Markov theorem is by contradiction. See here. What that means is that we don't analytically derive such an estimator directly from the loss function. You may have seen such an approach used with deriving linear and logistic regression estimating equations.

Best Answer

Related Solutions

Solved – How to derive the gradient for weighted least squares

Related Question