[Math] Objective function of linear regression problem with regularization

We have the following:

The design matrix $X \in R^{n \times d}$
The output vector $y \in R^n$
The weight vector $w \in R^d$

Let $T = \tau I_d$, where $I_d$ is the $d \times d$ identity matrix and $\tau \geq 0$. We now define

$$X' = (X | T)$$

and

$$y' = (y | 0…..0)^T$$

with $X' \in R^{(n+d) \times d}$ and $y' \in R^{n+d}$. How does the objective function for the new dataset $(X',y')$ differ from the objective function for the original dataset $(X,y)$? What type of regularization do the new data points impose?

Edit: I see now that by extending the matrix X to (X | T) where we have the scalars $\tau$ in the diagonal elements, we add the terms $(\tau*w_{n+d})^2$ to the objective function f(w) of the regression. This in fact will act as a regularizer, that forces the weights close to the origin for large $\tau$ and hence to the default least squares formulation, or forces the solution towards the $w_{n+d}$ for small $\tau$. Am I being right with this?

Any hints/feedback on my ideas welcome!

Best Answer

The idea in your edit is on the right track: this regularization tries to force the weights to be smaller. Just expand out the matrix blockwise. We have: \begin{align} \min_w \frac{1}{2}\|X'w-y' \|^2 &= \min_w \frac{1}{2}\left\|\begin{bmatrix}X w -y\\ Tw\end{bmatrix} \right\|^2 \\ &= \min_w \frac{1}{2}\|Xw - y \|^2 + \frac{\tau}{2} \| w\|^2, \end{align} where going from the first line to the second line we used the fact that $$\left\|\begin{bmatrix}u \\ v\end{bmatrix}\right\|^2 = \|u\|^2 + \|v\|^2$$ (for the 2-norm).

So,

The first term, $\frac{1}{2}\| Xw-y \|^2$, tries to make the observed data match the predicted data based on weights.
The second term, $\frac{\tau}{2}\| w\|^2$, tries to make the weights as small as possible.
The "regularization parameter", $\tau$, regulates the tradeoff between these two competing goals.

Best Answer

Related Solutions

Prove that $ \Phi(\Phi^T\Phi)^{-1}\Phi^T=I $ if $\Phi$ has more columns than rows

Gradient in linear regression with weights

Related Question